2025-05-07T20:22:35.2725960Z Current runner version: '2.323.0'
2025-05-07T20:22:35.2732574Z Runner name: 'i-0c2643f2bcfaf5e6b'
2025-05-07T20:22:35.2733572Z Machine name: 'ip-10-0-1-116'
2025-05-07T20:22:35.2736373Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:35.2738729Z Contents: read
2025-05-07T20:22:35.2739245Z Metadata: read
2025-05-07T20:22:35.2739730Z Packages: read
2025-05-07T20:22:35.2740327Z ##[endgroup]
2025-05-07T20:22:35.2742682Z Secret source: None
2025-05-07T20:22:35.2743741Z Prepare workflow directory
2025-05-07T20:22:35.3263071Z Prepare all required actions
2025-05-07T20:22:35.3299673Z Getting action download info
2025-05-07T20:22:35.5620465Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.8397091Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:36.2004470Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.8107259Z Getting action download info
2025-05-07T20:22:37.9011455Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:38.1343941Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.10, 12.6.3, 12.6.3, gcc)
2025-05-07T20:22:38.1960699Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:38.2095422Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:38.2108451Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:38.2109983Z ##[endgroup]
2025-05-07T20:22:39.3939490Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.3940082Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.3940342Z AMI Name: unknown
2025-05-07T20:22:39.3981344Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.7477913Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.7478222Z with:
2025-05-07T20:22:44.7478449Z   submodules: true
2025-05-07T20:22:44.7478687Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.7479090Z   token: ***
2025-05-07T20:22:44.7479292Z   ssh-strict: true
2025-05-07T20:22:44.7479503Z   ssh-user: git
2025-05-07T20:22:44.7479727Z   persist-credentials: true
2025-05-07T20:22:44.7479975Z   clean: true
2025-05-07T20:22:44.7480201Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.7480471Z   fetch-depth: 1
2025-05-07T20:22:44.7480686Z   fetch-tags: false
2025-05-07T20:22:44.7480899Z   show-progress: true
2025-05-07T20:22:44.7481122Z   lfs: false
2025-05-07T20:22:44.7481324Z   set-safe-directory: true
2025-05-07T20:22:44.7481575Z env:
2025-05-07T20:22:44.7481781Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.7482087Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.7482367Z   BUILD_TARGET: genai
2025-05-07T20:22:44.7482587Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.7482851Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:44.7483101Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.7483339Z ##[endgroup]
2025-05-07T20:22:44.8636749Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.8637939Z ##[group]Getting Git version info
2025-05-07T20:22:44.8638378Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.8638988Z [command]/usr/bin/git version
2025-05-07T20:22:44.8639250Z git version 2.47.1
2025-05-07T20:22:44.8645277Z ##[endgroup]
2025-05-07T20:22:44.8668146Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/af869ebb-95fa-41ed-9d48-4e5f3a9a72b2' before making global git config changes
2025-05-07T20:22:44.8669056Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.8673066Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.8710264Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.8713003Z ##[group]Initializing the repository
2025-05-07T20:22:44.8717125Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.8760266Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.8760913Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.8761449Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.8761824Z hint:
2025-05-07T20:22:44.8762112Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.8762442Z hint:
2025-05-07T20:22:44.8762763Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.8763307Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.8763715Z hint:
2025-05-07T20:22:44.8763942Z hint: 	git branch -m <name>
2025-05-07T20:22:44.8764460Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.8774338Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.8809022Z ##[endgroup]
2025-05-07T20:22:44.8809737Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.8813772Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.8845843Z ##[endgroup]
2025-05-07T20:22:44.8846439Z ##[group]Setting up auth
2025-05-07T20:22:44.8853271Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.8886044Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.9251909Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.9284795Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.9635830Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.9686944Z ##[endgroup]
2025-05-07T20:22:44.9687622Z ##[group]Fetching the repository
2025-05-07T20:22:44.9696505Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3163001Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3163529Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3188068Z ##[endgroup]
2025-05-07T20:22:45.3188457Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3191321Z ##[endgroup]
2025-05-07T20:22:45.3195873Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.3230386Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.3257718Z ##[group]Checking out the ref
2025-05-07T20:22:45.3261709Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.4354720Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.4355048Z 
2025-05-07T20:22:45.4355297Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.4355927Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.4356435Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.4356742Z 
2025-05-07T20:22:45.4356954Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.4357432Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.4357701Z 
2025-05-07T20:22:45.4357823Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.4358011Z 
2025-05-07T20:22:45.4358139Z Or undo this operation with:
2025-05-07T20:22:45.4358313Z 
2025-05-07T20:22:45.4358408Z   git switch -
2025-05-07T20:22:45.4358885Z 
2025-05-07T20:22:45.4359114Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.4359444Z 
2025-05-07T20:22:45.4359824Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.4368449Z ##[endgroup]
2025-05-07T20:22:45.4368852Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.4373874Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.4423474Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.4454945Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.4486488Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.4515706Z ##[endgroup]
2025-05-07T20:22:45.4516094Z ##[group]Fetching submodules
2025-05-07T20:22:45.4518424Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.4862760Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.5193567Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.5195757Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.5198160Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.5201475Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.5204968Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.5208725Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.5211933Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.5242743Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.8854804Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.3693432Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:46.8146500Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:47.9688167Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.2280562Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.5195539Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.7025151Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.7025755Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.7497751Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:50.3724727Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:50.3725205Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:50.6530431Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:51.2691514Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:51.2692001Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:51.3693447Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:52.4916542Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:52.4917061Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:53.1909475Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.1331104Z From https://github.com/google/googletest
2025-05-07T20:22:54.1331562Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.1739790Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:54.8851652Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:54.8852592Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:54.8935285Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:55.6040030Z From https://github.com/nlohmann/json
2025-05-07T20:22:55.6040693Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:55.7145147Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:55.7165089Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:55.7500114Z Entering 'external/asmjit'
2025-05-07T20:22:55.7532544Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.7564394Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.7596828Z Entering 'external/cutlass'
2025-05-07T20:22:55.7628677Z Entering 'external/googletest'
2025-05-07T20:22:55.7660216Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.7692639Z Entering 'external/json'
2025-05-07T20:22:55.7735747Z ##[endgroup]
2025-05-07T20:22:55.7736137Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:55.7742760Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:55.8073798Z Entering 'external/asmjit'
2025-05-07T20:22:55.8140079Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.8216235Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.8282840Z Entering 'external/cutlass'
2025-05-07T20:22:55.8362557Z Entering 'external/googletest'
2025-05-07T20:22:55.8429984Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.8495897Z Entering 'external/json'
2025-05-07T20:22:55.8580406Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:55.8906604Z Entering 'external/asmjit'
2025-05-07T20:22:55.8968105Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:55.8970602Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.9031594Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:55.9034547Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.9095695Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:55.9098803Z Entering 'external/cutlass'
2025-05-07T20:22:55.9159736Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:55.9162740Z Entering 'external/googletest'
2025-05-07T20:22:55.9224430Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:55.9227428Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.9288835Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:55.9292757Z Entering 'external/json'
2025-05-07T20:22:55.9354909Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:55.9456890Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:55.9786151Z Entering 'external/asmjit'
2025-05-07T20:22:55.9819925Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.9853067Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.9885839Z Entering 'external/cutlass'
2025-05-07T20:22:55.9917667Z Entering 'external/googletest'
2025-05-07T20:22:55.9949980Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.9982562Z Entering 'external/json'
2025-05-07T20:22:56.0031758Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.0350938Z Entering 'external/asmjit'
2025-05-07T20:22:56.0382459Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.0413780Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.0445302Z Entering 'external/cutlass'
2025-05-07T20:22:56.0478672Z Entering 'external/googletest'
2025-05-07T20:22:56.0510270Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.0541223Z Entering 'external/json'
2025-05-07T20:22:56.0584241Z ##[endgroup]
2025-05-07T20:22:56.0645155Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.0655244Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.0837094Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.0837391Z with:
2025-05-07T20:22:56.0837625Z   name: fbgemm_genai_x86_gcc_py3.10_cu12.6.3.whl
2025-05-07T20:22:56.0837933Z   merge-multiple: false
2025-05-07T20:22:56.0838178Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.0838417Z   run-id: 14891846252
2025-05-07T20:22:56.0838629Z env:
2025-05-07T20:22:56.0838851Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.0839132Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.0839369Z   BUILD_TARGET: genai
2025-05-07T20:22:56.0839580Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.0839807Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.0840043Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.0840267Z ##[endgroup]
2025-05-07T20:22:56.3145825Z Downloading single artifact
2025-05-07T20:22:56.4118796Z Preparing to download the following artifacts:
2025-05-07T20:22:56.4119721Z - fbgemm_genai_x86_gcc_py3.10_cu12.6.3.whl (ID: 3081361682, Size: 12507040, Expected Digest: sha256:54786970e5b7d46c26833313b7eb27e7a268d8dcd818a1c2bdaca6edadbd9a0b)
2025-05-07T20:22:56.4878966Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-1bd3b0b6-3733-53f4-b996-74ebe9e5efe1/artifacts/1754a5081fdead90bf158dc66d782ebfca5c7dcf5e2261bf900fbf3d44fedad1.zip
2025-05-07T20:22:56.4880365Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:56.5836794Z (node:57041) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:56.5837735Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:56.8406346Z SHA256 digest of downloaded artifact is 54786970e5b7d46c26833313b7eb27e7a268d8dcd818a1c2bdaca6edadbd9a0b
2025-05-07T20:22:56.8406947Z Artifact download completed successfully.
2025-05-07T20:22:56.8407275Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:56.8412599Z Download artifact has finished successfully
2025-05-07T20:22:56.8674323Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:56.8674708Z with:
2025-05-07T20:22:56.8674918Z   driver-version: 570.133.07
2025-05-07T20:22:56.8675158Z env:
2025-05-07T20:22:56.8675368Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.8675660Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.8675899Z   BUILD_TARGET: genai
2025-05-07T20:22:56.8676112Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.8676343Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.8676594Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.8676822Z ##[endgroup]
2025-05-07T20:22:56.8770566Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:56.8770938Z with:
2025-05-07T20:22:56.8771323Z   timeout_minutes: 10
2025-05-07T20:22:56.8771556Z   max_attempts: 3
2025-05-07T20:22:56.8795168Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:56.8818495Z   retry_wait_seconds: 10
2025-05-07T20:22:56.8818757Z   polling_interval_seconds: 1
2025-05-07T20:22:56.8819010Z   warning_on_retry: true
2025-05-07T20:22:56.8819251Z   continue_on_error: false
2025-05-07T20:22:56.8819486Z env:
2025-05-07T20:22:56.8819708Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.8820091Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.8820332Z   BUILD_TARGET: genai
2025-05-07T20:22:56.8820555Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.8820798Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.8821052Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.8821292Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:56.8821534Z ##[endgroup]
2025-05-07T20:22:56.9628677Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:56.9629568Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:56.9632413Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.5030352Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.5030789Z No packages marked for removal.
2025-05-07T20:22:57.5094541Z Dependencies resolved.
2025-05-07T20:22:57.5104150Z Nothing to do.
2025-05-07T20:22:57.5104401Z Complete!
2025-05-07T20:22:57.5436092Z + install_nvidia_driver_common
2025-05-07T20:22:57.5441601Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.5441896Z + lspci
2025-05-07T20:22:57.5443581Z Before installing NVIDIA driver
2025-05-07T20:22:57.5625385Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.5629068Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.5642469Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.5643325Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.5644103Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.5644953Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.5645741Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.5646544Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.5647198Z + lsmod
2025-05-07T20:22:57.5673782Z Module                  Size  Used by
2025-05-07T20:22:57.5674306Z xt_conntrack           16384  1
2025-05-07T20:22:57.5674735Z nft_chain_nat          16384  3
2025-05-07T20:22:57.5675152Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.5675647Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.5676150Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.5676817Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.5677558Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.5678072Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.5678519Z xfrm_user              57344  1
2025-05-07T20:22:57.5678953Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.5679410Z xt_addrtype            16384  2
2025-05-07T20:22:57.5679812Z nft_compat             20480  4
2025-05-07T20:22:57.5680258Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.5680899Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.5681504Z br_netfilter           36864  0
2025-05-07T20:22:57.5681954Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.5682421Z stp                    16384  1 bridge
2025-05-07T20:22:57.5682859Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.5683283Z overlay               167936  0
2025-05-07T20:22:57.5683663Z tls                   135168  0
2025-05-07T20:22:57.5684046Z nls_ascii              16384  1
2025-05-07T20:22:57.5684458Z nls_cp437              20480  1
2025-05-07T20:22:57.5684867Z vfat                   24576  1
2025-05-07T20:22:57.5685255Z fat                    86016  1 vfat
2025-05-07T20:22:57.5685681Z sunrpc                696320  1
2025-05-07T20:22:57.5686070Z ena                   180224  0
2025-05-07T20:22:57.5686452Z i8042                  45056  0
2025-05-07T20:22:57.5686858Z serio                  28672  3 i8042
2025-05-07T20:22:57.5687299Z button                 24576  0
2025-05-07T20:22:57.5687726Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.5688169Z dm_mod                188416  0
2025-05-07T20:22:57.5688583Z sch_fq_codel           20480  17
2025-05-07T20:22:57.5689001Z fuse                  163840  1
2025-05-07T20:22:57.5689392Z loop                   36864  0
2025-05-07T20:22:57.5690134Z configfs               57344  1
2025-05-07T20:22:57.5690576Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.5691010Z dmi_sysfs              20480  0
2025-05-07T20:22:57.5691428Z crc32_pclmul           16384  0
2025-05-07T20:22:57.5691852Z crc32c_intel           24576  0
2025-05-07T20:22:57.5692273Z efivarfs               24576  1
2025-05-07T20:22:57.5692731Z + modinfo nvidia
2025-05-07T20:22:57.5693350Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.5694079Z import_ns:      DMA_BUF
2025-05-07T20:22:57.5694464Z alias:          char-major-195-*
2025-05-07T20:22:57.5694891Z version:        570.133.07
2025-05-07T20:22:57.5695299Z supported:      external
2025-05-07T20:22:57.5695686Z license:        Dual MIT/GPL
2025-05-07T20:22:57.5696160Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.5696677Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.5697510Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.5698039Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.5698641Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.5699161Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.5699643Z depends:        i2c-core,drm
2025-05-07T20:22:57.5700175Z retpoline:      Y
2025-05-07T20:22:57.5700517Z name:           nvidia
2025-05-07T20:22:57.5701078Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.5701823Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.5702549Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.5703476Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.5704071Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.5704583Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.5705069Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.5705572Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.5706245Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.5706812Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.5707368Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.5707707Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.5708010Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.5708310Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.5708670Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.5709059Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.5709425Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.5709848Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.5710253Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.5710670Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.5711090Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.5711429Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.5711798Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.5712160Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.5712506Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.5712827Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.5713179Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.5713504Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.5713813Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.5714149Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.5714511Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.5714838Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.5715189Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.5715528Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.5715876Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.5716215Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.5716536Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.5716830Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.5717154Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.5717471Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.5717781Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.5718108Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.5718457Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.5718792Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.5719118Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.5719462Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.5719785Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.5720221Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.5720469Z ++ command -v nvidia-smi
2025-05-07T20:22:57.5720722Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.5720979Z + set +e
2025-05-07T20:22:57.5721286Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:59.3673819Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:59.3674156Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:59.3674710Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:59.3674936Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:59.3675208Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:59.3675647Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:59.3676110Z + set -e
2025-05-07T20:22:59.3677132Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:59.3677528Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:59.3677999Z + post_install_nvidia_driver_common
2025-05-07T20:22:59.3680994Z + sudo modprobe nvidia
2025-05-07T20:22:59.5183454Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:59.5183879Z + lspci
2025-05-07T20:22:59.5184176Z After installing NVIDIA driver
2025-05-07T20:22:59.5300476Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:59.5301173Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:59.5301819Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:59.5302335Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:59.5302816Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:59.5303537Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:59.5304226Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:59.5304707Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:59.5305112Z + lsmod
2025-05-07T20:22:59.5331939Z Module                  Size  Used by
2025-05-07T20:22:59.5332372Z nvidia_uvm           1884160  0
2025-05-07T20:22:59.5332784Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:22:59.5333178Z drm                   602112  1 nvidia
2025-05-07T20:22:59.5333576Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:59.5333890Z backlight              24576  1 drm
2025-05-07T20:22:59.5334181Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:59.5334580Z xt_conntrack           16384  1
2025-05-07T20:22:59.5334948Z nft_chain_nat          16384  3
2025-05-07T20:22:59.5335312Z xt_MASQUERADE          20480  1
2025-05-07T20:22:59.5335629Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:59.5335991Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:59.5336388Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:59.5336827Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:59.5337142Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:59.5337434Z xfrm_user              57344  1
2025-05-07T20:22:59.5337703Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:59.5337987Z xt_addrtype            16384  2
2025-05-07T20:22:59.5338264Z nft_compat             20480  4
2025-05-07T20:22:59.5338562Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:59.5338971Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:59.5339346Z br_netfilter           36864  0
2025-05-07T20:22:59.5339617Z bridge                323584  1 br_netfilter
2025-05-07T20:22:59.5340026Z stp                    16384  1 bridge
2025-05-07T20:22:59.5340313Z llc                    16384  2 bridge,stp
2025-05-07T20:22:59.5340594Z overlay               167936  0
2025-05-07T20:22:59.5340852Z tls                   135168  0
2025-05-07T20:22:59.5341103Z nls_ascii              16384  1
2025-05-07T20:22:59.5341621Z nls_cp437              20480  1
2025-05-07T20:22:59.5341872Z vfat                   24576  1
2025-05-07T20:22:59.5342126Z fat                    86016  1 vfat
2025-05-07T20:22:59.5342391Z sunrpc                696320  1
2025-05-07T20:22:59.5342631Z ena                   180224  0
2025-05-07T20:22:59.5342876Z i8042                  45056  0
2025-05-07T20:22:59.5343128Z serio                  28672  3 i8042
2025-05-07T20:22:59.5343391Z button                 24576  0
2025-05-07T20:22:59.5343647Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:59.5343905Z dm_mod                188416  0
2025-05-07T20:22:59.5344153Z sch_fq_codel           20480  17
2025-05-07T20:22:59.5344411Z fuse                  163840  1
2025-05-07T20:22:59.5344657Z loop                   36864  0
2025-05-07T20:22:59.5345078Z configfs               57344  1
2025-05-07T20:22:59.5345335Z dax                    45056  1 dm_mod
2025-05-07T20:22:59.5345609Z dmi_sysfs              20480  0
2025-05-07T20:22:59.5345860Z crc32_pclmul           16384  0
2025-05-07T20:22:59.5346118Z crc32c_intel           24576  0
2025-05-07T20:22:59.5346372Z efivarfs               24576  1
2025-05-07T20:22:59.5346620Z + modinfo nvidia
2025-05-07T20:22:59.5349581Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:59.5350241Z import_ns:      DMA_BUF
2025-05-07T20:22:59.5350595Z alias:          char-major-195-*
2025-05-07T20:22:59.5350959Z version:        570.133.07
2025-05-07T20:22:59.5351297Z supported:      external
2025-05-07T20:22:59.5351556Z license:        Dual MIT/GPL
2025-05-07T20:22:59.5351841Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:59.5352170Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:59.5352490Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:59.5352815Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:59.5353144Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:59.5353478Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:59.5353789Z depends:        i2c-core,drm
2025-05-07T20:22:59.5354050Z retpoline:      Y
2025-05-07T20:22:59.5354358Z name:           nvidia
2025-05-07T20:22:59.5354844Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:59.5355434Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:59.5355868Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:59.5356277Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:59.5356584Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:59.5356877Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:59.5357188Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:59.5357485Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:59.5357784Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:59.5358140Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:59.5358528Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:59.5358857Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:59.5359197Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:59.5359500Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:59.5359856Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:59.5360243Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:59.5360618Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:59.5361023Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.5361423Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:59.5361837Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.5362237Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:59.5362575Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:59.5362933Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:59.5363428Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:59.5363805Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:59.5364151Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:59.5364516Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:59.5364872Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:59.5365209Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:59.5365594Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:59.5365995Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:59.5366349Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:59.5366739Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:59.5367082Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:59.5367506Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:59.5367835Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:59.5368161Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:59.5368450Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:59.5368763Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:59.5369084Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:59.5369398Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:59.5369722Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:59.5370065Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:59.5370405Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:59.5370733Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:59.5371070Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:59.5371404Z parm:           rm_firmware_active:charp
2025-05-07T20:22:59.5371677Z + set +e
2025-05-07T20:22:59.5371869Z + nvidia-smi
2025-05-07T20:23:00.9249800Z Wed May  7 20:23:00 2025       
2025-05-07T20:23:00.9250192Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.9250702Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:00.9251180Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.9251654Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:00.9252172Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:00.9252594Z |                                         |                        |               MIG M. |
2025-05-07T20:23:00.9252927Z |=========================================+========================+======================|
2025-05-07T20:23:00.9313630Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:00.9314091Z |  0%   31C    P0             63W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:00.9314473Z |                                         |                        |                  N/A |
2025-05-07T20:23:00.9314864Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.9315267Z                                                                                          
2025-05-07T20:23:00.9315657Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.9316075Z | Processes:                                                                              |
2025-05-07T20:23:00.9316507Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:00.9316906Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:00.9317249Z |=========================================================================================|
2025-05-07T20:23:00.9318840Z |  No running processes found                                                             |
2025-05-07T20:23:00.9320116Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.3508248Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:02.7367488Z NVIDIA A10G
2025-05-07T20:23:03.0051507Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:03.0051897Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:03.0052169Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:03.0052453Z + set -e
2025-05-07T20:23:03.0052666Z INFO: Ignoring allowed status 0
2025-05-07T20:23:03.0061119Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:03.0064802Z + sudo yum install -y yum-utils
2025-05-07T20:23:03.4237641Z Last metadata expiration check: 0:05:51 ago on Wed May  7 20:17:12 2025.
2025-05-07T20:23:03.4482681Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:03.4878138Z Dependencies resolved.
2025-05-07T20:23:03.5060270Z Nothing to do.
2025-05-07T20:23:03.5060620Z Complete!
2025-05-07T20:23:03.5440794Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:03.5441612Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.5442589Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.8672226Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.9243714Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:04.4486912Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:04.4734485Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:04.5137764Z Dependencies resolved.
2025-05-07T20:23:04.5318079Z ================================================================================
2025-05-07T20:23:04.5318498Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:04.5318899Z ================================================================================
2025-05-07T20:23:04.5319203Z Downgrading:
2025-05-07T20:23:04.5319559Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:04.5320139Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:04.5320493Z 
2025-05-07T20:23:04.5320588Z Transaction Summary
2025-05-07T20:23:04.5320835Z ================================================================================
2025-05-07T20:23:04.5321134Z Downgrade  2 Packages
2025-05-07T20:23:04.5321286Z 
2025-05-07T20:23:04.5321404Z Total download size: 6.8 M
2025-05-07T20:23:04.5322718Z Downloading Packages:
2025-05-07T20:23:04.5818539Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  26 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:04.6212396Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  64 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:04.6221406Z --------------------------------------------------------------------------------
2025-05-07T20:23:04.6224791Z Total                                            76 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:04.6227756Z Running transaction check
2025-05-07T20:23:04.6332050Z Transaction check succeeded.
2025-05-07T20:23:04.6332688Z Running transaction test
2025-05-07T20:23:04.6626386Z Transaction test succeeded.
2025-05-07T20:23:04.6630242Z Running transaction
2025-05-07T20:23:05.2080547Z   Preparing        :                                                        1/1 
2025-05-07T20:23:05.3130864Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:05.3154954Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.3371580Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.3372351Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.3475354Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.3498422Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:06.7631672Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:06.7632479Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:06.7633212Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.7633845Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:06.8987044Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:06.8988952Z WARNING:
2025-05-07T20:23:06.8989432Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:06.8990427Z 
2025-05-07T20:23:06.8990616Z   Available Versions:
2025-05-07T20:23:06.8990922Z 
2025-05-07T20:23:06.8991118Z   Version 2023.7.20250331:
2025-05-07T20:23:06.8991520Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:06.8991793Z 
2025-05-07T20:23:06.8991914Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:06.8992128Z 
2025-05-07T20:23:06.8992214Z     Release notes:
2025-05-07T20:23:06.8992626Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:06.8992997Z 
2025-05-07T20:23:06.8993097Z   Version 2023.7.20250414:
2025-05-07T20:23:06.8993398Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:06.8993700Z 
2025-05-07T20:23:06.8993850Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:06.8994092Z 
2025-05-07T20:23:06.8994252Z     Release notes:
2025-05-07T20:23:06.8994912Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:06.8995339Z 
2025-05-07T20:23:06.8995467Z   Version 2023.7.20250428:
2025-05-07T20:23:06.8995856Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:06.8996190Z 
2025-05-07T20:23:06.8996377Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:06.8996651Z 
2025-05-07T20:23:06.8996767Z     Release notes:
2025-05-07T20:23:06.8997234Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:06.9008849Z 
2025-05-07T20:23:06.9008974Z ================================================================================
2025-05-07T20:23:06.9353767Z  
2025-05-07T20:23:06.9354133Z 
2025-05-07T20:23:06.9354288Z Downgraded:
2025-05-07T20:23:06.9354781Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:06.9355542Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:06.9356008Z 
2025-05-07T20:23:06.9356112Z Complete!
2025-05-07T20:23:06.9835544Z + sudo systemctl restart docker
2025-05-07T20:23:10.8644770Z Wed May  7 20:23:10 2025       
2025-05-07T20:23:10.8645339Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:10.8645879Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:10.8646374Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:10.8646867Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:10.8647377Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:10.8647805Z |                                         |                        |               MIG M. |
2025-05-07T20:23:10.8648140Z |=========================================+========================+======================|
2025-05-07T20:23:10.8728819Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:10.8730595Z |  0%   31C    P0             63W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:10.8731362Z |                                         |                        |                  N/A |
2025-05-07T20:23:10.8732122Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:10.8732924Z                                                                                          
2025-05-07T20:23:10.8733462Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:10.8734003Z | Processes:                                                                              |
2025-05-07T20:23:10.8734440Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:10.8735030Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:10.8735376Z |=========================================================================================|
2025-05-07T20:23:10.8735805Z |  No running processes found                                                             |
2025-05-07T20:23:10.8736274Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.9383031Z Command completed after 1 attempt(s).
2025-05-07T20:23:11.9471814Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:11.9472279Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:11.9486912Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:11.9487262Z env:
2025-05-07T20:23:11.9487488Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:11.9487791Z   BUILD_ENV: build_binary
2025-05-07T20:23:11.9488035Z   BUILD_TARGET: genai
2025-05-07T20:23:11.9488273Z   BUILD_VARIANT: cuda
2025-05-07T20:23:11.9488503Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:11.9488756Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:11.9489057Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:11.9489381Z ##[endgroup]
2025-05-07T20:23:12.2837955Z ################################################################################
2025-05-07T20:23:12.2838315Z # Print System Info
2025-05-07T20:23:12.2838541Z #
2025-05-07T20:23:12.2853119Z # [2025-05-07T20:23:12.284Z] + print_system_info 
2025-05-07T20:23:12.2853480Z ################################################################################
2025-05-07T20:23:12.2853692Z 
2025-05-07T20:23:12.2853806Z ################################################################################
2025-05-07T20:23:12.2854137Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.2854436Z + printenv
2025-05-07T20:23:12.2854550Z 
2025-05-07T20:23:12.2877204Z SHELL=/bin/bash
2025-05-07T20:23:12.2877699Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.2878129Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.2878658Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4a7d9ed4-eb36-47db-9632-4aa240a026c5
2025-05-07T20:23:12.2879269Z GITHUB_ACTION=__run
2025-05-07T20:23:12.2879560Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.2879898Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.2880147Z RUNNER_NAME=i-0c2643f2bcfaf5e6b
2025-05-07T20:23:12.2880436Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.2880732Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.2881000Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.2881366Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.2881784Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.2882062Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.2882356Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.2882998Z ***
2025-05-07T20:23:12.2883195Z LOGNAME=ec2-user
2025-05-07T20:23:12.2883443Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.2883708Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.2883934Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.2884160Z SYSTEMD_EXEC_PID=55541
2025-05-07T20:23:12.2884444Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.2884980Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.2885487Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.2885773Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.2886025Z RUNNER_OS=Linux
2025-05-07T20:23:12.2886248Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.2886491Z HOME=/home/ec2-user
2025-05-07T20:23:12.2886735Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.2887023Z LANG=C.UTF-8
2025-05-07T20:23:12.2887318Z RUNNER_TRACKING_ID=github_aa71c52e-5c10-4a56-a421-f206faa9b39e
2025-05-07T20:23:12.2887671Z RUNNER_ARCH=X64
2025-05-07T20:23:12.2887937Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.2888537Z BUILD_TARGET=genai
2025-05-07T20:23:12.2889068Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_4a7d9ed4-eb36-47db-9632-4aa240a026c5
2025-05-07T20:23:12.2890298Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_4a7d9ed4-eb36-47db-9632-4aa240a026c5
2025-05-07T20:23:12.2891036Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.2891730Z INVOCATION_ID=dad162b31b1f499cb44ecb48a70cac1d
2025-05-07T20:23:12.2892064Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.2892318Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.2892892Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_4a7d9ed4-eb36-47db-9632-4aa240a026c5
2025-05-07T20:23:12.2893504Z BUILD_ENV=build_binary
2025-05-07T20:23:12.2893726Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.2893943Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.2894169Z KERN_NAME_LC=linux
2025-05-07T20:23:12.2894393Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:12.2894690Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.2895030Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.2895267Z USER=ec2-user
2025-05-07T20:23:12.2895501Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.2895780Z SHLVL=1
2025-05-07T20:23:12.2895971Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.2896284Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.2896740Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.2897090Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.2897326Z KERN_NAME=Linux
2025-05-07T20:23:12.2897551Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.2897944Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.2898366Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.2898640Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.2898875Z JOURNAL_STREAM=8:81829
2025-05-07T20:23:12.2899192Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.2899556Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.2899985Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.2900320Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.2900539Z CI=true
2025-05-07T20:23:12.2900752Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.2901024Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.2901296Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.2901541Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.2902135Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_4a7d9ed4-eb36-47db-9632-4aa240a026c5
2025-05-07T20:23:12.2902711Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.2902928Z _=/usr/bin/printenv
2025-05-07T20:23:12.2903059Z 
2025-05-07T20:23:12.2903192Z ################################################################################
2025-05-07T20:23:12.2903514Z [INFO] Print ldd version ...
2025-05-07T20:23:12.2903773Z + ldd --version
2025-05-07T20:23:12.2903900Z 
2025-05-07T20:23:12.2903992Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.2904255Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.2904692Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.2905217Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.2905655Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.2905877Z 
2025-05-07T20:23:12.2905990Z ################################################################################
2025-05-07T20:23:12.2906298Z [INFO] Print CPU info ...
2025-05-07T20:23:12.2906534Z + nproc
2025-05-07T20:23:12.2906641Z 
2025-05-07T20:23:12.2920148Z 16
2025-05-07T20:23:12.2921884Z 
2025-05-07T20:23:12.2922540Z + lscpu
2025-05-07T20:23:12.2922701Z 
2025-05-07T20:23:12.3035071Z Architecture:                         x86_64
2025-05-07T20:23:12.3035482Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.3036296Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3036697Z Byte Order:                           Little Endian
2025-05-07T20:23:12.3037008Z CPU(s):                               16
2025-05-07T20:23:12.3037311Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.3037639Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.3037988Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.3038301Z CPU family:                           23
2025-05-07T20:23:12.3038925Z Model:                                49
2025-05-07T20:23:12.3039255Z Thread(s) per core:                   2
2025-05-07T20:23:12.3039560Z Core(s) per socket:                   8
2025-05-07T20:23:12.3039851Z Socket(s):                            1
2025-05-07T20:23:12.3040140Z Stepping:                             0
2025-05-07T20:23:12.3040449Z BogoMIPS:                             5599.62
2025-05-07T20:23:12.3042507Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3044614Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.3044917Z Virtualization type:                  full
2025-05-07T20:23:12.3045257Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.3045620Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.3045973Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.3046324Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.3046653Z NUMA node(s):                         1
2025-05-07T20:23:12.3046939Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.3047272Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.3047643Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.3048002Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.3048347Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.3048705Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.3049062Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.3049418Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.3049967Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.3050563Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.3051116Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.3051829Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.3052725Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.3053414Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.3053781Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.3054011Z 
2025-05-07T20:23:12.3054112Z + cat /proc/cpuinfo
2025-05-07T20:23:12.3054249Z 
2025-05-07T20:23:12.3054421Z processor	: 0
2025-05-07T20:23:12.3054636Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3054891Z cpu family	: 23
2025-05-07T20:23:12.3055107Z model		: 49
2025-05-07T20:23:12.3055318Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3055573Z stepping	: 0
2025-05-07T20:23:12.3055790Z microcode	: 0x830107f
2025-05-07T20:23:12.3056104Z cpu MHz		: 2862.187
2025-05-07T20:23:12.3056321Z cache size	: 512 KB
2025-05-07T20:23:12.3056535Z physical id	: 0
2025-05-07T20:23:12.3056740Z siblings	: 16
2025-05-07T20:23:12.3056942Z core id		: 0
2025-05-07T20:23:12.3057146Z cpu cores	: 8
2025-05-07T20:23:12.3057342Z apicid		: 0
2025-05-07T20:23:12.3057545Z initial apicid	: 0
2025-05-07T20:23:12.3057756Z fpu		: yes
2025-05-07T20:23:12.3057949Z fpu_exception	: yes
2025-05-07T20:23:12.3058165Z cpuid level	: 13
2025-05-07T20:23:12.3058378Z wp		: yes
2025-05-07T20:23:12.3060486Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3062712Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3063192Z bogomips	: 5599.62
2025-05-07T20:23:12.3063415Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3063652Z clflush size	: 64
2025-05-07T20:23:12.3063864Z cache_alignment	: 64
2025-05-07T20:23:12.3064135Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3064456Z power management:
2025-05-07T20:23:12.3064588Z 
2025-05-07T20:23:12.3064674Z processor	: 1
2025-05-07T20:23:12.3064895Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3065138Z cpu family	: 23
2025-05-07T20:23:12.3065347Z model		: 49
2025-05-07T20:23:12.3065557Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3065802Z stepping	: 0
2025-05-07T20:23:12.3066009Z microcode	: 0x830107f
2025-05-07T20:23:12.3066235Z cpu MHz		: 3299.185
2025-05-07T20:23:12.3066449Z cache size	: 512 KB
2025-05-07T20:23:12.3066671Z physical id	: 0
2025-05-07T20:23:12.3066881Z siblings	: 16
2025-05-07T20:23:12.3067083Z core id		: 1
2025-05-07T20:23:12.3067276Z cpu cores	: 8
2025-05-07T20:23:12.3067483Z apicid		: 2
2025-05-07T20:23:12.3067681Z initial apicid	: 2
2025-05-07T20:23:12.3067894Z fpu		: yes
2025-05-07T20:23:12.3068100Z fpu_exception	: yes
2025-05-07T20:23:12.3068320Z cpuid level	: 13
2025-05-07T20:23:12.3068528Z wp		: yes
2025-05-07T20:23:12.3070453Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3072634Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3073118Z bogomips	: 5599.62
2025-05-07T20:23:12.3073338Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3073565Z clflush size	: 64
2025-05-07T20:23:12.3073781Z cache_alignment	: 64
2025-05-07T20:23:12.3074051Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3074358Z power management:
2025-05-07T20:23:12.3074496Z 
2025-05-07T20:23:12.3074581Z processor	: 2
2025-05-07T20:23:12.3074800Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3075031Z cpu family	: 23
2025-05-07T20:23:12.3075242Z model		: 49
2025-05-07T20:23:12.3075449Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3075681Z stepping	: 0
2025-05-07T20:23:12.3075890Z microcode	: 0x830107f
2025-05-07T20:23:12.3076116Z cpu MHz		: 3299.868
2025-05-07T20:23:12.3076329Z cache size	: 512 KB
2025-05-07T20:23:12.3076536Z physical id	: 0
2025-05-07T20:23:12.3076744Z siblings	: 16
2025-05-07T20:23:12.3077029Z core id		: 2
2025-05-07T20:23:12.3077220Z cpu cores	: 8
2025-05-07T20:23:12.3077416Z apicid		: 4
2025-05-07T20:23:12.3077612Z initial apicid	: 4
2025-05-07T20:23:12.3077815Z fpu		: yes
2025-05-07T20:23:12.3078013Z fpu_exception	: yes
2025-05-07T20:23:12.3078231Z cpuid level	: 13
2025-05-07T20:23:12.3078433Z wp		: yes
2025-05-07T20:23:12.3080425Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3082606Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3083092Z bogomips	: 5599.62
2025-05-07T20:23:12.3083306Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3083546Z clflush size	: 64
2025-05-07T20:23:12.3083759Z cache_alignment	: 64
2025-05-07T20:23:12.3084021Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3084335Z power management:
2025-05-07T20:23:12.3084473Z 
2025-05-07T20:23:12.3084555Z processor	: 3
2025-05-07T20:23:12.3084771Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3085011Z cpu family	: 23
2025-05-07T20:23:12.3085220Z model		: 49
2025-05-07T20:23:12.3085427Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3085665Z stepping	: 0
2025-05-07T20:23:12.3085878Z microcode	: 0x830107f
2025-05-07T20:23:12.3086108Z cpu MHz		: 3255.951
2025-05-07T20:23:12.3086323Z cache size	: 512 KB
2025-05-07T20:23:12.3086535Z physical id	: 0
2025-05-07T20:23:12.3086740Z siblings	: 16
2025-05-07T20:23:12.3086940Z core id		: 3
2025-05-07T20:23:12.3087136Z cpu cores	: 8
2025-05-07T20:23:12.3087335Z apicid		: 6
2025-05-07T20:23:12.3087534Z initial apicid	: 6
2025-05-07T20:23:12.3087750Z fpu		: yes
2025-05-07T20:23:12.3087945Z fpu_exception	: yes
2025-05-07T20:23:12.3088159Z cpuid level	: 13
2025-05-07T20:23:12.3088368Z wp		: yes
2025-05-07T20:23:12.3090572Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3092758Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3093263Z bogomips	: 5599.62
2025-05-07T20:23:12.3093478Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3093713Z clflush size	: 64
2025-05-07T20:23:12.3093929Z cache_alignment	: 64
2025-05-07T20:23:12.3094192Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3094507Z power management:
2025-05-07T20:23:12.3094636Z 
2025-05-07T20:23:12.3094723Z processor	: 4
2025-05-07T20:23:12.3094930Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3095164Z cpu family	: 23
2025-05-07T20:23:12.3095369Z model		: 49
2025-05-07T20:23:12.3095574Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3095819Z stepping	: 0
2025-05-07T20:23:12.3096029Z microcode	: 0x830107f
2025-05-07T20:23:12.3096255Z cpu MHz		: 3145.190
2025-05-07T20:23:12.3096462Z cache size	: 512 KB
2025-05-07T20:23:12.3096678Z physical id	: 0
2025-05-07T20:23:12.3096886Z siblings	: 16
2025-05-07T20:23:12.3097080Z core id		: 4
2025-05-07T20:23:12.3097277Z cpu cores	: 8
2025-05-07T20:23:12.3097473Z apicid		: 8
2025-05-07T20:23:12.3097811Z initial apicid	: 8
2025-05-07T20:23:12.3112021Z fpu		: yes
2025-05-07T20:23:12.3112250Z fpu_exception	: yes
2025-05-07T20:23:12.3112491Z cpuid level	: 13
2025-05-07T20:23:12.3112705Z wp		: yes
2025-05-07T20:23:12.3114871Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3117078Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3117574Z bogomips	: 5599.62
2025-05-07T20:23:12.3117800Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3118057Z clflush size	: 64
2025-05-07T20:23:12.3118284Z cache_alignment	: 64
2025-05-07T20:23:12.3118556Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3118886Z power management:
2025-05-07T20:23:12.3119035Z 
2025-05-07T20:23:12.3119123Z processor	: 5
2025-05-07T20:23:12.3119349Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3119591Z cpu family	: 23
2025-05-07T20:23:12.3119806Z model		: 49
2025-05-07T20:23:12.3120026Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3120274Z stepping	: 0
2025-05-07T20:23:12.3120492Z microcode	: 0x830107f
2025-05-07T20:23:12.3120731Z cpu MHz		: 2241.648
2025-05-07T20:23:12.3120949Z cache size	: 512 KB
2025-05-07T20:23:12.3121174Z physical id	: 0
2025-05-07T20:23:12.3121392Z siblings	: 16
2025-05-07T20:23:12.3121594Z core id		: 5
2025-05-07T20:23:12.3121804Z cpu cores	: 8
2025-05-07T20:23:12.3122016Z apicid		: 10
2025-05-07T20:23:12.3122221Z initial apicid	: 10
2025-05-07T20:23:12.3122441Z fpu		: yes
2025-05-07T20:23:12.3122656Z fpu_exception	: yes
2025-05-07T20:23:12.3122875Z cpuid level	: 13
2025-05-07T20:23:12.3123091Z wp		: yes
2025-05-07T20:23:12.3125022Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3127213Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3127716Z bogomips	: 5599.62
2025-05-07T20:23:12.3127943Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3128191Z clflush size	: 64
2025-05-07T20:23:12.3128419Z cache_alignment	: 64
2025-05-07T20:23:12.3128688Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3129008Z power management:
2025-05-07T20:23:12.3129143Z 
2025-05-07T20:23:12.3129234Z processor	: 6
2025-05-07T20:23:12.3129449Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3129691Z cpu family	: 23
2025-05-07T20:23:12.3129902Z model		: 49
2025-05-07T20:23:12.3130111Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3130356Z stepping	: 0
2025-05-07T20:23:12.3130566Z microcode	: 0x830107f
2025-05-07T20:23:12.3130789Z cpu MHz		: 3295.045
2025-05-07T20:23:12.3131007Z cache size	: 512 KB
2025-05-07T20:23:12.3131226Z physical id	: 0
2025-05-07T20:23:12.3131429Z siblings	: 16
2025-05-07T20:23:12.3131635Z core id		: 6
2025-05-07T20:23:12.3131839Z cpu cores	: 8
2025-05-07T20:23:12.3132033Z apicid		: 12
2025-05-07T20:23:12.3132250Z initial apicid	: 12
2025-05-07T20:23:12.3132466Z fpu		: yes
2025-05-07T20:23:12.3132662Z fpu_exception	: yes
2025-05-07T20:23:12.3132887Z cpuid level	: 13
2025-05-07T20:23:12.3133202Z wp		: yes
2025-05-07T20:23:12.3135208Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3137439Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3137916Z bogomips	: 5599.62
2025-05-07T20:23:12.3138144Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3138385Z clflush size	: 64
2025-05-07T20:23:12.3138600Z cache_alignment	: 64
2025-05-07T20:23:12.3138874Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3139199Z power management:
2025-05-07T20:23:12.3139333Z 
2025-05-07T20:23:12.3139416Z processor	: 7
2025-05-07T20:23:12.3139630Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3139962Z cpu family	: 23
2025-05-07T20:23:12.3140166Z model		: 49
2025-05-07T20:23:12.3140380Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3140622Z stepping	: 0
2025-05-07T20:23:12.3140828Z microcode	: 0x830107f
2025-05-07T20:23:12.3141057Z cpu MHz		: 2138.754
2025-05-07T20:23:12.3141282Z cache size	: 512 KB
2025-05-07T20:23:12.3141505Z physical id	: 0
2025-05-07T20:23:12.3141715Z siblings	: 16
2025-05-07T20:23:12.3141920Z core id		: 7
2025-05-07T20:23:12.3142122Z cpu cores	: 8
2025-05-07T20:23:12.3142324Z apicid		: 14
2025-05-07T20:23:12.3142536Z initial apicid	: 14
2025-05-07T20:23:12.3142757Z fpu		: yes
2025-05-07T20:23:12.3142951Z fpu_exception	: yes
2025-05-07T20:23:12.3143173Z cpuid level	: 13
2025-05-07T20:23:12.3143383Z wp		: yes
2025-05-07T20:23:12.3145296Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3147524Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3148012Z bogomips	: 5599.62
2025-05-07T20:23:12.3148238Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3148472Z clflush size	: 64
2025-05-07T20:23:12.3148692Z cache_alignment	: 64
2025-05-07T20:23:12.3148962Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3149275Z power management:
2025-05-07T20:23:12.3149413Z 
2025-05-07T20:23:12.3149494Z processor	: 8
2025-05-07T20:23:12.3149713Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3149952Z cpu family	: 23
2025-05-07T20:23:12.3150155Z model		: 49
2025-05-07T20:23:12.3150367Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3150609Z stepping	: 0
2025-05-07T20:23:12.3150822Z microcode	: 0x830107f
2025-05-07T20:23:12.3151043Z cpu MHz		: 3254.760
2025-05-07T20:23:12.3151262Z cache size	: 512 KB
2025-05-07T20:23:12.3151478Z physical id	: 0
2025-05-07T20:23:12.3151695Z siblings	: 16
2025-05-07T20:23:12.3151898Z core id		: 0
2025-05-07T20:23:12.3152091Z cpu cores	: 8
2025-05-07T20:23:12.3152296Z apicid		: 1
2025-05-07T20:23:12.3152496Z initial apicid	: 1
2025-05-07T20:23:12.3152698Z fpu		: yes
2025-05-07T20:23:12.3152891Z fpu_exception	: yes
2025-05-07T20:23:12.3153098Z cpuid level	: 13
2025-05-07T20:23:12.3153294Z wp		: yes
2025-05-07T20:23:12.3155195Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3157681Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3158159Z bogomips	: 5599.62
2025-05-07T20:23:12.3158370Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3158596Z clflush size	: 64
2025-05-07T20:23:12.3158796Z cache_alignment	: 64
2025-05-07T20:23:12.3159055Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3159368Z power management:
2025-05-07T20:23:12.3159499Z 
2025-05-07T20:23:12.3159589Z processor	: 9
2025-05-07T20:23:12.3159797Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3160021Z cpu family	: 23
2025-05-07T20:23:12.3160220Z model		: 49
2025-05-07T20:23:12.3160413Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3160645Z stepping	: 0
2025-05-07T20:23:12.3160852Z microcode	: 0x830107f
2025-05-07T20:23:12.3161068Z cpu MHz		: 3299.600
2025-05-07T20:23:12.3161270Z cache size	: 512 KB
2025-05-07T20:23:12.3161541Z physical id	: 0
2025-05-07T20:23:12.3161747Z siblings	: 16
2025-05-07T20:23:12.3161952Z core id		: 1
2025-05-07T20:23:12.3162163Z cpu cores	: 8
2025-05-07T20:23:12.3162360Z apicid		: 3
2025-05-07T20:23:12.3162564Z initial apicid	: 3
2025-05-07T20:23:12.3162785Z fpu		: yes
2025-05-07T20:23:12.3162984Z fpu_exception	: yes
2025-05-07T20:23:12.3163206Z cpuid level	: 13
2025-05-07T20:23:12.3163415Z wp		: yes
2025-05-07T20:23:12.3165328Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3167523Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3168018Z bogomips	: 5599.62
2025-05-07T20:23:12.3168242Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3168481Z clflush size	: 64
2025-05-07T20:23:12.3168698Z cache_alignment	: 64
2025-05-07T20:23:12.3168973Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3169290Z power management:
2025-05-07T20:23:12.3169426Z 
2025-05-07T20:23:12.3169512Z processor	: 10
2025-05-07T20:23:12.3169737Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3169983Z cpu family	: 23
2025-05-07T20:23:12.3170187Z model		: 49
2025-05-07T20:23:12.3170399Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3170650Z stepping	: 0
2025-05-07T20:23:12.3170858Z microcode	: 0x830107f
2025-05-07T20:23:12.3171091Z cpu MHz		: 3248.721
2025-05-07T20:23:12.3171313Z cache size	: 512 KB
2025-05-07T20:23:12.3171525Z physical id	: 0
2025-05-07T20:23:12.3171743Z siblings	: 16
2025-05-07T20:23:12.3171952Z core id		: 2
2025-05-07T20:23:12.3172147Z cpu cores	: 8
2025-05-07T20:23:12.3172353Z apicid		: 5
2025-05-07T20:23:12.3172557Z initial apicid	: 5
2025-05-07T20:23:12.3172796Z fpu		: yes
2025-05-07T20:23:12.3173020Z fpu_exception	: yes
2025-05-07T20:23:12.3173256Z cpuid level	: 13
2025-05-07T20:23:12.3173460Z wp		: yes
2025-05-07T20:23:12.3175371Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3177672Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3178149Z bogomips	: 5599.62
2025-05-07T20:23:12.3178476Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3178711Z clflush size	: 64
2025-05-07T20:23:12.3178924Z cache_alignment	: 64
2025-05-07T20:23:12.3179186Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3179496Z power management:
2025-05-07T20:23:12.3179627Z 
2025-05-07T20:23:12.3179715Z processor	: 11
2025-05-07T20:23:12.3179988Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3180225Z cpu family	: 23
2025-05-07T20:23:12.3180425Z model		: 49
2025-05-07T20:23:12.3180633Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3180884Z stepping	: 0
2025-05-07T20:23:12.3181090Z microcode	: 0x830107f
2025-05-07T20:23:12.3181310Z cpu MHz		: 3203.262
2025-05-07T20:23:12.3181519Z cache size	: 512 KB
2025-05-07T20:23:12.3181729Z physical id	: 0
2025-05-07T20:23:12.3181929Z siblings	: 16
2025-05-07T20:23:12.3182126Z core id		: 3
2025-05-07T20:23:12.3182323Z cpu cores	: 8
2025-05-07T20:23:12.3182514Z apicid		: 7
2025-05-07T20:23:12.3182711Z initial apicid	: 7
2025-05-07T20:23:12.3182941Z fpu		: yes
2025-05-07T20:23:12.3183166Z fpu_exception	: yes
2025-05-07T20:23:12.3183383Z cpuid level	: 13
2025-05-07T20:23:12.3183588Z wp		: yes
2025-05-07T20:23:12.3185506Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3187706Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3188183Z bogomips	: 5599.62
2025-05-07T20:23:12.3188405Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3188646Z clflush size	: 64
2025-05-07T20:23:12.3188859Z cache_alignment	: 64
2025-05-07T20:23:12.3189134Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3189445Z power management:
2025-05-07T20:23:12.3189576Z 
2025-05-07T20:23:12.3189657Z processor	: 12
2025-05-07T20:23:12.3190145Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3190471Z cpu family	: 23
2025-05-07T20:23:12.3190669Z model		: 49
2025-05-07T20:23:12.3190871Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3191122Z stepping	: 0
2025-05-07T20:23:12.3191324Z microcode	: 0x830107f
2025-05-07T20:23:12.3191556Z cpu MHz		: 2993.440
2025-05-07T20:23:12.3191769Z cache size	: 512 KB
2025-05-07T20:23:12.3191983Z physical id	: 0
2025-05-07T20:23:12.3192187Z siblings	: 16
2025-05-07T20:23:12.3192388Z core id		: 4
2025-05-07T20:23:12.3192585Z cpu cores	: 8
2025-05-07T20:23:12.3192780Z apicid		: 9
2025-05-07T20:23:12.3192983Z initial apicid	: 9
2025-05-07T20:23:12.3193191Z fpu		: yes
2025-05-07T20:23:12.3193383Z fpu_exception	: yes
2025-05-07T20:23:12.3193602Z cpuid level	: 13
2025-05-07T20:23:12.3193809Z wp		: yes
2025-05-07T20:23:12.3195713Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3198047Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3198526Z bogomips	: 5599.62
2025-05-07T20:23:12.3198748Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3198979Z clflush size	: 64
2025-05-07T20:23:12.3199192Z cache_alignment	: 64
2025-05-07T20:23:12.3199589Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3199901Z power management:
2025-05-07T20:23:12.3200043Z 
2025-05-07T20:23:12.3200128Z processor	: 13
2025-05-07T20:23:12.3200345Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3200582Z cpu family	: 23
2025-05-07T20:23:12.3200781Z model		: 49
2025-05-07T20:23:12.3200983Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3201223Z stepping	: 0
2025-05-07T20:23:12.3201424Z microcode	: 0x830107f
2025-05-07T20:23:12.3201654Z cpu MHz		: 3010.012
2025-05-07T20:23:12.3201868Z cache size	: 512 KB
2025-05-07T20:23:12.3202079Z physical id	: 0
2025-05-07T20:23:12.3202293Z siblings	: 16
2025-05-07T20:23:12.3202492Z core id		: 5
2025-05-07T20:23:12.3202684Z cpu cores	: 8
2025-05-07T20:23:12.3202881Z apicid		: 11
2025-05-07T20:23:12.3203081Z initial apicid	: 11
2025-05-07T20:23:12.3203284Z fpu		: yes
2025-05-07T20:23:12.3203485Z fpu_exception	: yes
2025-05-07T20:23:12.3203698Z cpuid level	: 13
2025-05-07T20:23:12.3203902Z wp		: yes
2025-05-07T20:23:12.3205818Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3208007Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3208496Z bogomips	: 5599.62
2025-05-07T20:23:12.3208718Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3208946Z clflush size	: 64
2025-05-07T20:23:12.3209159Z cache_alignment	: 64
2025-05-07T20:23:12.3209428Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3209739Z power management:
2025-05-07T20:23:12.3209876Z 
2025-05-07T20:23:12.3209959Z processor	: 14
2025-05-07T20:23:12.3210175Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3210406Z cpu family	: 23
2025-05-07T20:23:12.3210611Z model		: 49
2025-05-07T20:23:12.3210817Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3211052Z stepping	: 0
2025-05-07T20:23:12.3211260Z microcode	: 0x830107f
2025-05-07T20:23:12.3211484Z cpu MHz		: 3294.466
2025-05-07T20:23:12.3211691Z cache size	: 512 KB
2025-05-07T20:23:12.3211909Z physical id	: 0
2025-05-07T20:23:12.3212116Z siblings	: 16
2025-05-07T20:23:12.3212309Z core id		: 6
2025-05-07T20:23:12.3212509Z cpu cores	: 8
2025-05-07T20:23:12.3212707Z apicid		: 13
2025-05-07T20:23:12.3212907Z initial apicid	: 13
2025-05-07T20:23:12.3213125Z fpu		: yes
2025-05-07T20:23:12.3213326Z fpu_exception	: yes
2025-05-07T20:23:12.3213533Z cpuid level	: 13
2025-05-07T20:23:12.3213739Z wp		: yes
2025-05-07T20:23:12.3215664Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3217976Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3218453Z bogomips	: 5599.62
2025-05-07T20:23:12.3218672Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3218906Z clflush size	: 64
2025-05-07T20:23:12.3219120Z cache_alignment	: 64
2025-05-07T20:23:12.3219382Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3219691Z power management:
2025-05-07T20:23:12.3219879Z 
2025-05-07T20:23:12.3220060Z processor	: 15
2025-05-07T20:23:12.3220277Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.3220513Z cpu family	: 23
2025-05-07T20:23:12.3220719Z model		: 49
2025-05-07T20:23:12.3220921Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.3221160Z stepping	: 0
2025-05-07T20:23:12.3221374Z microcode	: 0x830107f
2025-05-07T20:23:12.3221600Z cpu MHz		: 3206.726
2025-05-07T20:23:12.3221809Z cache size	: 512 KB
2025-05-07T20:23:12.3222022Z physical id	: 0
2025-05-07T20:23:12.3222234Z siblings	: 16
2025-05-07T20:23:12.3222429Z core id		: 7
2025-05-07T20:23:12.3222631Z cpu cores	: 8
2025-05-07T20:23:12.3222827Z apicid		: 15
2025-05-07T20:23:12.3223023Z initial apicid	: 15
2025-05-07T20:23:12.3223238Z fpu		: yes
2025-05-07T20:23:12.3223444Z fpu_exception	: yes
2025-05-07T20:23:12.3223661Z cpuid level	: 13
2025-05-07T20:23:12.3223876Z wp		: yes
2025-05-07T20:23:12.3225800Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.3227993Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.3228479Z bogomips	: 5599.62
2025-05-07T20:23:12.3228700Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.3228936Z clflush size	: 64
2025-05-07T20:23:12.3229159Z cache_alignment	: 64
2025-05-07T20:23:12.3229429Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.3229747Z power management:
2025-05-07T20:23:12.3229881Z 
2025-05-07T20:23:12.3229885Z 
2025-05-07T20:23:12.3230013Z ################################################################################
2025-05-07T20:23:12.3230322Z [INFO] Print PCI info ...
2025-05-07T20:23:12.3230578Z + lspci -v
2025-05-07T20:23:12.3230694Z 
2025-05-07T20:23:12.3230908Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.3231292Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.3231610Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.3231825Z 
2025-05-07T20:23:12.3232026Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.3232410Z 	Physical Slot: 1
2025-05-07T20:23:12.3232666Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.3232892Z 
2025-05-07T20:23:12.3233164Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.3233597Z 	Physical Slot: 1
2025-05-07T20:23:12.3233858Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.3234082Z 
2025-05-07T20:23:12.3234355Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.3234798Z 	Physical Slot: 3
2025-05-07T20:23:12.3235047Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.3235392Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.3235745Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.3235974Z 
2025-05-07T20:23:12.3236276Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.3236874Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.3237162Z 	Physical Slot: 4
2025-05-07T20:23:12.3237429Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.3237813Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.3238170Z 	Capabilities: <access denied>
2025-05-07T20:23:12.3238439Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.3238610Z 
2025-05-07T20:23:12.3238911Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.3239396Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.3239739Z 	Physical Slot: 5
2025-05-07T20:23:12.3239992Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.3240353Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.3240732Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.3241066Z 	Capabilities: <access denied>
2025-05-07T20:23:12.3241350Z 	Kernel driver in use: ena
2025-05-07T20:23:12.3241603Z 	Kernel modules: ena
2025-05-07T20:23:12.3241745Z 
2025-05-07T20:23:12.3241918Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.3242306Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.3242601Z 	Physical Slot: 30
2025-05-07T20:23:12.3242879Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.3243295Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.3243696Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.3244069Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.3244410Z 	Capabilities: <access denied>
2025-05-07T20:23:12.3244689Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.3244953Z 	Kernel modules: nvidia
2025-05-07T20:23:12.3245097Z 
2025-05-07T20:23:12.3245400Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.3245919Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.3246213Z 	Physical Slot: 31
2025-05-07T20:23:12.3246459Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.3246820Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.3247207Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.3247538Z 	Capabilities: <access denied>
2025-05-07T20:23:12.3247813Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.3247982Z 
2025-05-07T20:23:12.3247986Z 
2025-05-07T20:23:12.3248107Z ################################################################################
2025-05-07T20:23:12.3248437Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.3248722Z + uname -a
2025-05-07T20:23:12.3248847Z 
2025-05-07T20:23:12.3249252Z Linux ip-10-0-1-116.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.3249737Z 
2025-05-07T20:23:12.3249824Z + uname -m
2025-05-07T20:23:12.3249945Z 
2025-05-07T20:23:12.3250023Z x86_64
2025-05-07T20:23:12.3250129Z 
2025-05-07T20:23:12.3250215Z + cat /proc/version
2025-05-07T20:23:12.3250353Z 
2025-05-07T20:23:12.3250886Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.3251495Z 
2025-05-07T20:23:12.3251586Z + cat /etc/os-release
2025-05-07T20:23:12.3251727Z 
2025-05-07T20:23:12.3251839Z NAME="Amazon Linux"
2025-05-07T20:23:12.3252051Z VERSION="2023"
2025-05-07T20:23:12.3252257Z ID="amzn"
2025-05-07T20:23:12.3252445Z ID_LIKE="fedora"
2025-05-07T20:23:12.3252654Z VERSION_ID="2023"
2025-05-07T20:23:12.3252891Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.3253179Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.3260237Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.3260541Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.3261067Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.3261515Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.3261944Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.3262392Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.3262767Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.3263015Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.3263303Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.3263463Z 
2025-05-07T20:23:12.3263703Z ################################################################################
2025-05-07T20:23:12.3264020Z # Print EC2 Instance Info
2025-05-07T20:23:12.3264264Z #
2025-05-07T20:23:12.3264486Z # [2025-05-07T20:23:12.325Z] + print_ec2_info 
2025-05-07T20:23:12.3264807Z ################################################################################
2025-05-07T20:23:12.3265021Z 
2025-05-07T20:23:12.3386721Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:12.3496321Z instance-id: i-0c2643f2bcfaf5e6b
2025-05-07T20:23:12.3603618Z instance-type: g5.4xlarge
2025-05-07T20:23:12.3646486Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:12.3646848Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:12.3657356Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.3657702Z env:
2025-05-07T20:23:12.3657923Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.3658228Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.3658480Z   BUILD_TARGET: genai
2025-05-07T20:23:12.3658709Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.3658953Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.3659219Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.3659523Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.3659962Z ##[endgroup]
2025-05-07T20:23:12.7035521Z ################################################################################
2025-05-07T20:23:12.7035939Z [INFO] Printing general display info ...
2025-05-07T20:23:12.7065696Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:12.8230980Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:12.8239706Z /usr/bin/sudo
2025-05-07T20:23:12.8250740Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:12.8261684Z /usr/bin/yum
2025-05-07T20:23:12.8263312Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:12.8283577Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.2661468Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:13.3404754Z ================================================================================
2025-05-07T20:23:13.3405083Z WARNING:
2025-05-07T20:23:13.3405340Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.3405572Z 
2025-05-07T20:23:13.3405671Z   Available Versions:
2025-05-07T20:23:13.3405817Z 
2025-05-07T20:23:13.3405930Z   Version 2023.7.20250331:
2025-05-07T20:23:13.3406233Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.3406512Z 
2025-05-07T20:23:13.3406644Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.3406847Z 
2025-05-07T20:23:13.3406935Z     Release notes:
2025-05-07T20:23:13.3407332Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.3407702Z 
2025-05-07T20:23:13.3407792Z   Version 2023.7.20250414:
2025-05-07T20:23:13.3408099Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.3408341Z 
2025-05-07T20:23:13.3408461Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.3408668Z 
2025-05-07T20:23:13.3408752Z     Release notes:
2025-05-07T20:23:13.3409141Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.3409501Z 
2025-05-07T20:23:13.3409589Z   Version 2023.7.20250428:
2025-05-07T20:23:13.3409891Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.3410137Z 
2025-05-07T20:23:13.3410465Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.3410679Z 
2025-05-07T20:23:13.3410764Z     Release notes:
2025-05-07T20:23:13.3411150Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.3411506Z 
2025-05-07T20:23:13.3411626Z ================================================================================
2025-05-07T20:23:13.4578177Z Dependencies resolved.
2025-05-07T20:23:13.4867899Z ================================================================================
2025-05-07T20:23:13.4868320Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:13.4868706Z ================================================================================
2025-05-07T20:23:13.4869001Z Upgrading:
2025-05-07T20:23:13.4869358Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:13.4869937Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:13.4870291Z 
2025-05-07T20:23:13.4870610Z Transaction Summary
2025-05-07T20:23:13.4870868Z ================================================================================
2025-05-07T20:23:13.4871173Z Upgrade  2 Packages
2025-05-07T20:23:13.4871309Z 
2025-05-07T20:23:13.4871753Z Total download size: 6.9 M
2025-05-07T20:23:13.4872652Z Downloading Packages:
2025-05-07T20:23:13.5288521Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  31 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:13.5753326Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  65 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:13.5761265Z --------------------------------------------------------------------------------
2025-05-07T20:23:13.5764445Z Total                                            78 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:13.5767008Z Running transaction check
2025-05-07T20:23:13.5866190Z Transaction check succeeded.
2025-05-07T20:23:13.5866654Z Running transaction test
2025-05-07T20:23:13.6159403Z Transaction test succeeded.
2025-05-07T20:23:13.6163016Z Running transaction
2025-05-07T20:23:14.1663278Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.2728496Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.2757378Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.2948957Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.2949731Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.3062569Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.3092244Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.4525140Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:14.4526264Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:14.4527345Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:14.4528373Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:14.6063040Z ================================================================================
2025-05-07T20:23:14.6063729Z WARNING:
2025-05-07T20:23:14.6064057Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.6064321Z 
2025-05-07T20:23:14.6064414Z   Available Versions:
2025-05-07T20:23:14.6064562Z 
2025-05-07T20:23:14.6064661Z   Version 2023.7.20250331:
2025-05-07T20:23:14.6064966Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.6065224Z 
2025-05-07T20:23:14.6065347Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.6065557Z 
2025-05-07T20:23:14.6065641Z     Release notes:
2025-05-07T20:23:14.6066042Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.6066699Z 
2025-05-07T20:23:14.6066804Z   Version 2023.7.20250414:
2025-05-07T20:23:14.6067112Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.6067360Z 
2025-05-07T20:23:14.6067482Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.6067689Z 
2025-05-07T20:23:14.6067781Z     Release notes:
2025-05-07T20:23:14.6068164Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.6068527Z 
2025-05-07T20:23:14.6068615Z   Version 2023.7.20250428:
2025-05-07T20:23:14.6068930Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.6069176Z 
2025-05-07T20:23:14.6069290Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.6069503Z 
2025-05-07T20:23:14.6069591Z     Release notes:
2025-05-07T20:23:14.6069978Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.6070336Z 
2025-05-07T20:23:14.6070670Z ================================================================================
2025-05-07T20:23:14.6639701Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.6640046Z 
2025-05-07T20:23:14.6640129Z Upgraded:
2025-05-07T20:23:14.6640474Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:14.6641028Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:14.6641359Z 
2025-05-07T20:23:14.6641437Z Complete!
2025-05-07T20:23:14.7105851Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:14.7130582Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.2300629Z Last metadata expiration check: 0:00:11 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:15.2539208Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.2943942Z Dependencies resolved.
2025-05-07T20:23:15.3120409Z ================================================================================
2025-05-07T20:23:15.3120878Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.3121292Z ================================================================================
2025-05-07T20:23:15.3121594Z Installing:
2025-05-07T20:23:15.3121889Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.3122152Z 
2025-05-07T20:23:15.3122253Z Transaction Summary
2025-05-07T20:23:15.3122498Z ================================================================================
2025-05-07T20:23:15.3122800Z Install  1 Package
2025-05-07T20:23:15.3122935Z 
2025-05-07T20:23:15.3123057Z Total download size: 319 k
2025-05-07T20:23:15.3123848Z Installed size: 837 k
2025-05-07T20:23:15.3124848Z Downloading Packages:
2025-05-07T20:23:15.3901497Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.6 MB/s | 319 kB     00:00    
2025-05-07T20:23:15.3906988Z --------------------------------------------------------------------------------
2025-05-07T20:23:15.3909745Z Total                                           4.0 MB/s | 319 kB     00:00     
2025-05-07T20:23:15.4067454Z Running transaction check
2025-05-07T20:23:15.4122849Z Transaction check succeeded.
2025-05-07T20:23:15.4123408Z Running transaction test
2025-05-07T20:23:15.4582760Z Transaction test succeeded.
2025-05-07T20:23:15.4586543Z Running transaction
2025-05-07T20:23:15.5641565Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.6179701Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.8295635Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.9516400Z ================================================================================
2025-05-07T20:23:15.9516782Z WARNING:
2025-05-07T20:23:15.9517039Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.9517561Z 
2025-05-07T20:23:15.9517664Z   Available Versions:
2025-05-07T20:23:15.9517828Z 
2025-05-07T20:23:15.9517920Z   Version 2023.7.20250331:
2025-05-07T20:23:15.9518242Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.9518490Z 
2025-05-07T20:23:15.9518621Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.9518829Z 
2025-05-07T20:23:15.9518916Z     Release notes:
2025-05-07T20:23:15.9519329Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.9519702Z 
2025-05-07T20:23:15.9519796Z   Version 2023.7.20250414:
2025-05-07T20:23:15.9520105Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.9520348Z 
2025-05-07T20:23:15.9520464Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.9520674Z 
2025-05-07T20:23:15.9520762Z     Release notes:
2025-05-07T20:23:15.9521157Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.9521525Z 
2025-05-07T20:23:15.9521856Z   Version 2023.7.20250428:
2025-05-07T20:23:15.9522166Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.9522414Z 
2025-05-07T20:23:15.9522531Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.9522734Z 
2025-05-07T20:23:15.9522828Z     Release notes:
2025-05-07T20:23:15.9523217Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.9523580Z 
2025-05-07T20:23:15.9523695Z ================================================================================
2025-05-07T20:23:15.9861344Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.9861664Z 
2025-05-07T20:23:15.9861757Z Installed:
2025-05-07T20:23:15.9862063Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:15.9862354Z 
2025-05-07T20:23:15.9862436Z Complete!
2025-05-07T20:23:16.0332268Z + hostname
2025-05-07T20:23:16.0332461Z 
2025-05-07T20:23:16.0346965Z ip-10-0-1-116.ec2.internal
2025-05-07T20:23:16.0348523Z 
2025-05-07T20:23:16.0349166Z + sudo lshw -C display
2025-05-07T20:23:16.0349375Z 
2025-05-07T20:23:16.4780672Z   *-display:0 UNCLAIMED
2025-05-07T20:23:16.4781080Z        description: VGA compatible controller
2025-05-07T20:23:16.4781413Z        product: Amazon.com, Inc.
2025-05-07T20:23:16.4781688Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:16.4781938Z        physical id: 3
2025-05-07T20:23:16.4782175Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:16.4782431Z        version: 00
2025-05-07T20:23:16.4782635Z        width: 32 bits
2025-05-07T20:23:16.4782854Z        clock: 33MHz
2025-05-07T20:23:16.4783098Z        capabilities: vga_controller bus_master
2025-05-07T20:23:16.4783415Z        configuration: latency=0
2025-05-07T20:23:16.4783730Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:16.4784055Z   *-display:1
2025-05-07T20:23:16.4784283Z        description: 3D controller
2025-05-07T20:23:16.4784590Z        product: GA102GL [A10G]
2025-05-07T20:23:16.4784855Z        vendor: NVIDIA Corporation
2025-05-07T20:23:16.4785119Z        physical id: 1e
2025-05-07T20:23:16.4785350Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:16.4785600Z        version: a1
2025-05-07T20:23:16.4785808Z        width: 64 bits
2025-05-07T20:23:16.4786018Z        clock: 33MHz
2025-05-07T20:23:16.4786310Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:16.4786676Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:16.4787282Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:16.4819221Z 
2025-05-07T20:23:16.4819555Z ################################################################################
2025-05-07T20:23:16.4819989Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:16.4948563Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:16.5115045Z Wed May  7 20:23:16 2025       
2025-05-07T20:23:16.5115437Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.5115927Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:16.5117818Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.5118290Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:16.5118807Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:16.5119229Z |                                         |                        |               MIG M. |
2025-05-07T20:23:16.5119560Z |=========================================+========================+======================|
2025-05-07T20:23:16.5195193Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:16.5195862Z |  0%   31C    P0             60W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:16.5196239Z |                                         |                        |                  N/A |
2025-05-07T20:23:16.5196620Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.5197015Z                                                                                          
2025-05-07T20:23:16.5197400Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.5197818Z | Processes:                                                                              |
2025-05-07T20:23:16.5198247Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:16.5198653Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:16.5199001Z |=========================================================================================|
2025-05-07T20:23:16.5199997Z |  No running processes found                                                             |
2025-05-07T20:23:16.5200456Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.6606094Z ################################################################################
2025-05-07T20:23:16.6606445Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:16.6750184Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.6751226Z [CHECK] rocminfo not found
2025-05-07T20:23:16.6759962Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.6761438Z [CHECK] rocm-smi not found
2025-05-07T20:23:16.6820428Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:16.6820867Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:16.6832584Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.6832938Z env:
2025-05-07T20:23:16.6833163Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:16.6833464Z   BUILD_ENV: build_binary
2025-05-07T20:23:16.6833701Z   BUILD_TARGET: genai
2025-05-07T20:23:16.6833912Z   BUILD_VARIANT: cuda
2025-05-07T20:23:16.6834133Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:16.6834380Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:16.6834668Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:16.6834996Z ##[endgroup]
2025-05-07T20:23:17.0189202Z ################################################################################
2025-05-07T20:23:17.0189576Z # Setup Miniconda
2025-05-07T20:23:17.0189794Z #
2025-05-07T20:23:17.0206913Z # [2025-05-07T20:23:17.020Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:17.0207462Z ################################################################################
2025-05-07T20:23:17.0207679Z 
2025-05-07T20:23:17.0222566Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:17.1129824Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:17.1130320Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:17.1130516Z 
2025-05-07T20:23:17.1147385Z 
2025-05-07T20:23:17.1147595Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:17.1170180Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:18.4668936Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:18.4669333Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:18.4669597Z 
2025-05-07T20:23:18.4815153Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:18.9278176Z Unpacking payload ...
2025-05-07T20:23:19.4455360Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:20.2490407Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:22.3487497Z 
2025-05-07T20:23:22.3487909Z Installing base environment...
2025-05-07T20:23:22.3488162Z 
2025-05-07T20:23:23.4274268Z Preparing transaction: ...working... done
2025-05-07T20:23:26.4329981Z Executing transaction: ...working... done
2025-05-07T20:23:27.0905534Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:27.1785434Z installation finished.
2025-05-07T20:23:27.1793682Z 
2025-05-07T20:23:27.1794583Z + rm -f miniconda.sh
2025-05-07T20:23:27.1794850Z 
2025-05-07T20:23:27.2099515Z 
2025-05-07T20:23:27.2099943Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:27.2100309Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:27.5741186Z 
2025-05-07T20:23:27.5741470Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:27.5742042Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:27.5742576Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:27.5743100Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:27.5743636Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:27.5744219Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:27.5744859Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:27.5745513Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:27.5746196Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:27.5747344Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:27.5747916Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:27.5748284Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:27.5748474Z 
2025-05-07T20:23:27.5748676Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:27.5748971Z 
2025-05-07T20:23:27.6397060Z 
2025-05-07T20:23:27.6397515Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.6397702Z 
2025-05-07T20:23:28.4737186Z 
2025-05-07T20:23:28.4737930Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:28.4761518Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:42.0884696Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:23:43.6691929Z Solving environment: | / - \ | / - \ | / - \ done
2025-05-07T20:23:43.7659641Z 
2025-05-07T20:23:43.7659989Z ## Package Plan ##
2025-05-07T20:23:43.7660149Z 
2025-05-07T20:23:43.7660305Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.7660570Z 
2025-05-07T20:23:43.7660675Z   added / updated specs:
2025-05-07T20:23:43.7660944Z     - conda-libmamba-solver
2025-05-07T20:23:43.7661215Z     - libarchive
2025-05-07T20:23:43.7661425Z     - libmamba
2025-05-07T20:23:43.7661635Z     - libmambapy
2025-05-07T20:23:43.7661764Z 
2025-05-07T20:23:43.7661768Z 
2025-05-07T20:23:43.7661915Z The following packages will be downloaded:
2025-05-07T20:23:43.7662130Z 
2025-05-07T20:23:43.7662248Z     package                    |            build
2025-05-07T20:23:43.7662572Z     ---------------------------|-----------------
2025-05-07T20:23:43.7662991Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.7663471Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.7663895Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.7664370Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.7664818Z     ------------------------------------------------------------
2025-05-07T20:23:43.7665160Z                                            Total:         1.4 MB
2025-05-07T20:23:43.7665376Z 
2025-05-07T20:23:43.7665493Z The following packages will be UPDATED:
2025-05-07T20:23:43.7665706Z 
2025-05-07T20:23:43.7670637Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.7671413Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.7671786Z 
2025-05-07T20:23:43.7672001Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.7672321Z 
2025-05-07T20:23:43.7672634Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.7673420Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.7673892Z 
2025-05-07T20:23:43.7673904Z 
2025-05-07T20:23:43.7673908Z 
2025-05-07T20:23:43.7674052Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.7674425Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.7674641Z 
2025-05-07T20:23:43.7675143Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.7675392Z 
2025-05-07T20:23:43.7675396Z 
2025-05-07T20:23:43.7696632Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.7696902Z 
2025-05-07T20:23:43.7697104Z 
2025-05-07T20:23:43.7697147Z 
2025-05-07T20:23:43.8169327Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.8169608Z 
2025-05-07T20:23:43.8169825Z 
2025-05-07T20:23:43.8171386Z 
2025-05-07T20:23:43.8303126Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.8303410Z 
2025-05-07T20:23:43.8303414Z 
2025-05-07T20:23:43.8303418Z 
2025-05-07T20:23:43.8338197Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.8482925Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.8483632Z 
2025-05-07T20:23:43.8524090Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.8524379Z 
2025-05-07T20:23:43.8526608Z 
2025-05-07T20:23:43.8723440Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.8723746Z 
2025-05-07T20:23:43.8723948Z 
2025-05-07T20:23:43.8729195Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.8729708Z 
2025-05-07T20:23:43.8729820Z 
2025-05-07T20:23:43.8752686Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.8753059Z 
2025-05-07T20:23:43.8755991Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.8756259Z 
2025-05-07T20:23:43.9676601Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.9677133Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.9682715Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.9683047Z                                                      
2025-05-07T20:23:43.9683238Z 
2025-05-07T20:23:43.9683433Z                                                      [A
2025-05-07T20:23:43.9683640Z 
2025-05-07T20:23:43.9683644Z 
2025-05-07T20:23:43.9683824Z                                                      [A[A
2025-05-07T20:23:43.9684040Z 
2025-05-07T20:23:43.9684044Z 
2025-05-07T20:23:43.9684048Z 
2025-05-07T20:23:43.9684369Z                                                      [A[A[A done
2025-05-07T20:23:44.0691471Z Preparing transaction: / done
2025-05-07T20:23:44.1694718Z Verifying transaction: \ done
2025-05-07T20:23:45.4712400Z Executing transaction: / - \ | / - \ | / - \ | / done
2025-05-07T20:23:47.2062312Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:47.2087130Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:48.1184528Z Channels:
2025-05-07T20:23:48.1184789Z  - defaults
2025-05-07T20:23:48.1184997Z Platform: linux-64
2025-05-07T20:23:49.3488445Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:49.4662821Z Solving environment: - \ Channels:
2025-05-07T20:23:49.4663217Z  - defaults
2025-05-07T20:23:49.4663453Z Platform: linux-64
2025-05-07T20:23:49.7609890Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:23:49.9765880Z Solving environment: - \ | / done
2025-05-07T20:23:50.0594750Z done
2025-05-07T20:23:50.1255463Z 
2025-05-07T20:23:50.1255782Z ## Package Plan ##
2025-05-07T20:23:50.1255997Z 
2025-05-07T20:23:50.1256202Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:50.1256535Z 
2025-05-07T20:23:50.1256666Z   added / updated specs:
2025-05-07T20:23:50.1256965Z     - conda
2025-05-07T20:23:50.1257083Z 
2025-05-07T20:23:50.1257087Z 
2025-05-07T20:23:50.1257207Z The following packages will be downloaded:
2025-05-07T20:23:50.1257424Z 
2025-05-07T20:23:50.1257540Z     package                    |            build
2025-05-07T20:23:50.1257860Z     ---------------------------|-----------------
2025-05-07T20:23:50.1258198Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:50.1258825Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:50.1259317Z     ------------------------------------------------------------
2025-05-07T20:23:50.1259879Z                                            Total:         1.4 MB
2025-05-07T20:23:50.1260177Z 
2025-05-07T20:23:50.1260333Z The following packages will be UPDATED:
2025-05-07T20:23:50.1260625Z 
2025-05-07T20:23:50.1260956Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:50.1261458Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:50.1261702Z 
2025-05-07T20:23:50.1261706Z 
2025-05-07T20:23:50.1261710Z 
2025-05-07T20:23:50.1261856Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:50.1262215Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:50.1262431Z 
2025-05-07T20:23:50.1656070Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:50.1656532Z 
2025-05-07T20:23:50.2260009Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.2317689Z pip-25.1             | 1.3 MB    | ########9  |  90% 
2025-05-07T20:23:50.3546698Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.3547457Z 
2025-05-07T20:23:50.3548897Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.3549210Z 
2025-05-07T20:23:50.3996214Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.3998720Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.3999050Z                                                      
2025-05-07T20:23:50.3999249Z 
2025-05-07T20:23:50.3999470Z                                                      [A done
2025-05-07T20:23:50.5002512Z Preparing transaction: \ done
2025-05-07T20:23:50.6008991Z Verifying transaction: / done
2025-05-07T20:23:52.7039582Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:53.3154193Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:53.3155670Z + conda clean --packages --tarball -y
2025-05-07T20:23:53.3155978Z 
2025-05-07T20:23:54.3490738Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:54.3491161Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.4211460Z 
2025-05-07T20:23:54.4219995Z + conda clean --all -y
2025-05-07T20:23:54.4220200Z 
2025-05-07T20:23:54.9770792Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.9771166Z Will remove 1 index cache(s).
2025-05-07T20:23:54.9771450Z There are no unused package(s) to remove.
2025-05-07T20:23:54.9771755Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.9772045Z There are no logfile(s) to remove.
2025-05-07T20:23:55.0405475Z 
2025-05-07T20:23:55.0410322Z + conda info
2025-05-07T20:23:55.0410497Z 
2025-05-07T20:23:55.8195500Z 
2025-05-07T20:23:55.8195969Z      active environment : base
2025-05-07T20:23:55.8196324Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.8196700Z             shell level : 1
2025-05-07T20:23:55.8196989Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.8197365Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.8197748Z           conda version : 25.3.1
2025-05-07T20:23:55.8198029Z     conda-build version : not installed
2025-05-07T20:23:55.8198327Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.8198620Z                  solver : libmamba (default)
2025-05-07T20:23:55.8198931Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.8199225Z                           __conda=25.3.1=0
2025-05-07T20:23:55.8199501Z                           __cuda=12.8=0
2025-05-07T20:23:55.8199774Z                           __glibc=2.34=0
2025-05-07T20:23:55.8200051Z                           __linux=6.1.130=0
2025-05-07T20:23:55.8200316Z                           __unix=0=0
2025-05-07T20:23:55.8200646Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.8201404Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.8201748Z   conda av metadata url : None
2025-05-07T20:23:55.8202117Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.8202550Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.8202929Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.8203295Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.8203660Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.8203999Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.8204330Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.8204661Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.8204959Z                platform : linux-64
2025-05-07T20:23:55.8205790Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.8206597Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.8207013Z              netrc file : None
2025-05-07T20:23:55.8207274Z            offline mode : False
2025-05-07T20:23:55.8207438Z 
2025-05-07T20:23:55.8872431Z 
2025-05-07T20:23:55.8872706Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.8873416Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4d378ef6-9297-48d4-9fb0-05cc395e54c6 ...
2025-05-07T20:23:55.8874203Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.8946298Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.10
2025-05-07T20:23:55.8946787Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.10[0m
2025-05-07T20:23:55.8965592Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.8965943Z env:
2025-05-07T20:23:55.8966158Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.8966456Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.8966695Z   BUILD_TARGET: genai
2025-05-07T20:23:55.8966930Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.8967157Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:55.8967407Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.8967708Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.8968026Z ##[endgroup]
2025-05-07T20:23:56.2345751Z ################################################################################
2025-05-07T20:23:56.2346198Z # Create Conda Environment
2025-05-07T20:23:56.2346441Z #
2025-05-07T20:23:56.2360702Z # [2025-05-07T20:23:56.235Z] + create_conda_environment build_binary 3.10
2025-05-07T20:23:56.2361117Z ################################################################################
2025-05-07T20:23:56.2361331Z 
2025-05-07T20:23:56.2375574Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.3291604Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.3292158Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:56.3292620Z + conda info --envs
2025-05-07T20:23:56.3292850Z 
2025-05-07T20:23:57.0944041Z 
2025-05-07T20:23:57.0944291Z # conda environments:
2025-05-07T20:23:57.0944537Z #
2025-05-07T20:23:57.0952324Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:57.0952560Z 
2025-05-07T20:23:57.1615753Z 
2025-05-07T20:23:57.1616249Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.7957421Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.7957942Z 
2025-05-07T20:23:58.7974293Z 
2025-05-07T20:23:58.7983780Z [SETUP] Creating new Conda environment (Python 3.10) ...
2025-05-07T20:23:58.8005332Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.10
2025-05-07T20:23:59.5501546Z Channels:
2025-05-07T20:23:59.5502060Z  - defaults
2025-05-07T20:23:59.5502296Z Platform: linux-64
2025-05-07T20:24:01.0898967Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:01.1903964Z Solving environment: / done
2025-05-07T20:24:01.2191102Z 
2025-05-07T20:24:01.2191593Z ## Package Plan ##
2025-05-07T20:24:01.2191766Z 
2025-05-07T20:24:01.2191972Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:01.2192284Z 
2025-05-07T20:24:01.2192384Z   added / updated specs:
2025-05-07T20:24:01.2192638Z     - python=3.10
2025-05-07T20:24:01.2192775Z 
2025-05-07T20:24:01.2192780Z 
2025-05-07T20:24:01.2192912Z The following packages will be downloaded:
2025-05-07T20:24:01.2193127Z 
2025-05-07T20:24:01.2193260Z     package                    |            build
2025-05-07T20:24:01.2193583Z     ---------------------------|-----------------
2025-05-07T20:24:01.2193943Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:01.2194330Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:01.2194744Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:01.2195153Z     python-3.10.16             |       he870216_1        26.9 MB
2025-05-07T20:24:01.2195543Z     setuptools-78.1.1          |  py310h06a4308_0         1.7 MB
2025-05-07T20:24:01.2196255Z     wheel-0.45.1               |  py310h06a4308_0         115 KB
2025-05-07T20:24:01.2196618Z     ------------------------------------------------------------
2025-05-07T20:24:01.2196950Z                                            Total:        28.8 MB
2025-05-07T20:24:01.2197154Z 
2025-05-07T20:24:01.2197280Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:01.2197506Z 
2025-05-07T20:24:01.2197903Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:01.2198349Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:01.2198767Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:01.2199237Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:01.2199772Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:01.2200231Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:01.2200671Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.2201095Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:01.2201547Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.2201994Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:01.2202409Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:01.2202815Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:01.2203216Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:01.2203611Z   python             pkgs/main/linux-64::python-3.10.16-he870216_1 
2025-05-07T20:24:01.2204033Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:01.2204493Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 
2025-05-07T20:24:01.2204961Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:01.2205345Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:01.2205716Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:01.2206130Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 
2025-05-07T20:24:01.2206519Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:01.2206891Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:01.2207125Z 
2025-05-07T20:24:01.2207130Z 
2025-05-07T20:24:01.2207134Z 
2025-05-07T20:24:01.2207277Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:01.2207648Z python-3.10.16       | 26.9 MB   |            |   0% 
2025-05-07T20:24:01.2207879Z 
2025-05-07T20:24:01.2208859Z setuptools-78.1.1    | 1.7 MB    |            |   0% [A
2025-05-07T20:24:01.2209106Z 
2025-05-07T20:24:01.2209110Z 
2025-05-07T20:24:01.2215992Z ca-certificates-2025 | 129 KB    |            |   0% [A[A
2025-05-07T20:24:01.2216264Z 
2025-05-07T20:24:01.2216269Z 
2025-05-07T20:24:01.2216273Z 
2025-05-07T20:24:01.2237426Z wheel-0.45.1         | 115 KB    |            |   0% [A[A[A
2025-05-07T20:24:01.2237667Z 
2025-05-07T20:24:01.2237766Z 
2025-05-07T20:24:01.2237771Z 
2025-05-07T20:24:01.2238135Z 
2025-05-07T20:24:01.2245517Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:01.2245787Z 
2025-05-07T20:24:01.2245791Z 
2025-05-07T20:24:01.2245805Z 
2025-05-07T20:24:01.2245809Z 
2025-05-07T20:24:01.2245813Z 
2025-05-07T20:24:01.2558933Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:01.2559313Z 
2025-05-07T20:24:01.2559319Z 
2025-05-07T20:24:01.2559324Z 
2025-05-07T20:24:01.2766817Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.2767062Z 
2025-05-07T20:24:01.2767066Z 
2025-05-07T20:24:01.2767070Z 
2025-05-07T20:24:01.2767074Z 
2025-05-07T20:24:01.2772191Z 
2025-05-07T20:24:01.2885181Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.2885455Z 
2025-05-07T20:24:01.2886839Z 
2025-05-07T20:24:01.3033287Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.3033567Z 
2025-05-07T20:24:01.3033693Z 
2025-05-07T20:24:01.3033697Z 
2025-05-07T20:24:01.3033710Z 
2025-05-07T20:24:01.3033955Z 
2025-05-07T20:24:01.3194225Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.3199858Z python-3.10.16       | 26.9 MB   | 8          |   8% 
2025-05-07T20:24:01.3200127Z 
2025-05-07T20:24:01.3200132Z 
2025-05-07T20:24:01.3200137Z 
2025-05-07T20:24:01.3204874Z 
2025-05-07T20:24:01.3209197Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.3209957Z 
2025-05-07T20:24:01.3484979Z setuptools-78.1.1    | 1.7 MB    | #####9     |  60% [A
2025-05-07T20:24:01.3485255Z 
2025-05-07T20:24:01.3485988Z 
2025-05-07T20:24:01.3496905Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.3497263Z 
2025-05-07T20:24:01.3497269Z 
2025-05-07T20:24:01.3640409Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.3640686Z 
2025-05-07T20:24:01.3640690Z 
2025-05-07T20:24:01.3641013Z 
2025-05-07T20:24:01.3651436Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.3651700Z 
2025-05-07T20:24:01.3651705Z 
2025-05-07T20:24:01.3653272Z 
2025-05-07T20:24:01.3738429Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.3741667Z 
2025-05-07T20:24:01.4121561Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.4121813Z 
2025-05-07T20:24:01.4121817Z 
2025-05-07T20:24:01.4121821Z 
2025-05-07T20:24:01.4121825Z 
2025-05-07T20:24:01.4128083Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.4128345Z 
2025-05-07T20:24:01.4128350Z 
2025-05-07T20:24:01.4128354Z 
2025-05-07T20:24:01.4128357Z 
2025-05-07T20:24:01.4193680Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.5194687Z python-3.10.16       | 26.9 MB   | ###2       |  33% 
2025-05-07T20:24:01.5869176Z python-3.10.16       | 26.9 MB   | ########9  |  89% 
2025-05-07T20:24:01.7712765Z python-3.10.16       | 26.9 MB   | ########## | 100% 
2025-05-07T20:24:01.7713162Z 
2025-05-07T20:24:02.2540070Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:02.2545864Z python-3.10.16       | 26.9 MB   | ########## | 100% 
2025-05-07T20:24:02.2546233Z                                                      
2025-05-07T20:24:02.2546439Z 
2025-05-07T20:24:02.2546641Z                                                      [A
2025-05-07T20:24:02.2546837Z 
2025-05-07T20:24:02.2546841Z 
2025-05-07T20:24:02.2547013Z                                                      [A[A
2025-05-07T20:24:02.2547211Z 
2025-05-07T20:24:02.2547215Z 
2025-05-07T20:24:02.2547219Z 
2025-05-07T20:24:02.2547393Z                                                      [A[A[A
2025-05-07T20:24:02.2547606Z 
2025-05-07T20:24:02.2547610Z 
2025-05-07T20:24:02.2547614Z 
2025-05-07T20:24:02.2547617Z 
2025-05-07T20:24:02.2547792Z                                                      [A[A[A[A
2025-05-07T20:24:02.2548011Z 
2025-05-07T20:24:02.2548014Z 
2025-05-07T20:24:02.2548019Z 
2025-05-07T20:24:02.2548022Z 
2025-05-07T20:24:02.2548026Z 
2025-05-07T20:24:02.2548208Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.4654400Z Preparing transaction: \ | done
2025-05-07T20:24:03.6274999Z Verifying transaction: - \ | / - \ | / - \ | done
2025-05-07T20:24:05.9491300Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:05.9994258Z #
2025-05-07T20:24:05.9994511Z # To activate this environment, use
2025-05-07T20:24:05.9994792Z #
2025-05-07T20:24:05.9994999Z #     $ conda activate build_binary
2025-05-07T20:24:05.9995577Z #
2025-05-07T20:24:05.9995798Z # To deactivate an active environment, use
2025-05-07T20:24:05.9996085Z #
2025-05-07T20:24:05.9996274Z #     $ conda deactivate
2025-05-07T20:24:05.9996436Z 
2025-05-07T20:24:06.1090045Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:06.1112069Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:09.0546205Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (25.1)
2025-05-07T20:24:09.0547454Z Collecting pip
2025-05-07T20:24:09.0547791Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:09.0548209Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:09.0549046Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 55.6 MB/s eta 0:00:00
2025-05-07T20:24:09.0549407Z Installing collected packages: pip
2025-05-07T20:24:09.0549696Z   Attempting uninstall: pip
2025-05-07T20:24:09.0549984Z     Found existing installation: pip 25.1
2025-05-07T20:24:09.0550312Z     Uninstalling pip-25.1:
2025-05-07T20:24:09.0550585Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:09.0550896Z Successfully installed pip-25.1.1
2025-05-07T20:24:09.0551083Z 
2025-05-07T20:24:09.1182449Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:09.1204874Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:09.9783999Z Channels:
2025-05-07T20:24:09.9784330Z  - conda-forge
2025-05-07T20:24:09.9784594Z Platform: linux-64
2025-05-07T20:24:20.4777208Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:22.1049847Z Solving environment: / - \ | / - \ | done
2025-05-07T20:24:22.1649922Z 
2025-05-07T20:24:22.1650273Z ## Package Plan ##
2025-05-07T20:24:22.1650494Z 
2025-05-07T20:24:22.1650708Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:22.1651011Z 
2025-05-07T20:24:22.1651150Z   added / updated specs:
2025-05-07T20:24:22.1651413Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:22.1651613Z 
2025-05-07T20:24:22.1651617Z 
2025-05-07T20:24:22.1651738Z The following packages will be downloaded:
2025-05-07T20:24:22.1651955Z 
2025-05-07T20:24:22.1652070Z     package                    |            build
2025-05-07T20:24:22.1652394Z     ---------------------------|-----------------
2025-05-07T20:24:22.1652771Z     cffi-1.17.1                |  py310h8deb56e_0         238 KB  conda-forge
2025-05-07T20:24:22.1653216Z     cryptography-44.0.3        |  py310h6c63255_0         1.5 MB  conda-forge
2025-05-07T20:24:22.1653663Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:22.1654074Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:22.1654490Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:22.1654901Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:22.1655329Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:22.1655760Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:22.1656188Z     python_abi-3.10            |          2_cp310           4 KB  conda-forge
2025-05-07T20:24:22.1656643Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:22.1657132Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:22.1657551Z     ------------------------------------------------------------
2025-05-07T20:24:22.1657899Z                                            Total:         6.3 MB
2025-05-07T20:24:22.1658106Z 
2025-05-07T20:24:22.1658252Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:22.1658472Z 
2025-05-07T20:24:22.1658672Z   cffi               conda-forge/linux-64::cffi-1.17.1-py310h8deb56e_0 
2025-05-07T20:24:22.1659484Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py310h6c63255_0 
2025-05-07T20:24:22.1660059Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:22.1660511Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:22.1660976Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:22.1661433Z   python_abi         conda-forge/linux-64::python_abi-3.10-2_cp310 
2025-05-07T20:24:22.1662245Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:22.1662832Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:22.1663163Z 
2025-05-07T20:24:22.1663274Z The following packages will be UPDATED:
2025-05-07T20:24:22.1663481Z 
2025-05-07T20:24:22.1663863Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:22.1664609Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:22.1665252Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:22.1665868Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:22.1666221Z 
2025-05-07T20:24:22.1666225Z 
2025-05-07T20:24:22.1666229Z 
2025-05-07T20:24:22.1666378Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:22.1666754Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:22.1666981Z 
2025-05-07T20:24:22.1667367Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:22.1667607Z 
2025-05-07T20:24:22.1667611Z 
2025-05-07T20:24:22.1669257Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:22.1669494Z 
2025-05-07T20:24:22.1669564Z 
2025-05-07T20:24:22.1674207Z 
2025-05-07T20:24:22.1692349Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:22.1692610Z 
2025-05-07T20:24:22.1692613Z 
2025-05-07T20:24:22.1692617Z 
2025-05-07T20:24:22.1692621Z 
2025-05-07T20:24:22.1713243Z cffi-1.17.1          | 238 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:22.1713481Z 
2025-05-07T20:24:22.1713485Z 
2025-05-07T20:24:22.1713489Z 
2025-05-07T20:24:22.1713493Z 
2025-05-07T20:24:22.1713523Z 
2025-05-07T20:24:22.1716445Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:22.1716809Z 
2025-05-07T20:24:22.1716815Z 
2025-05-07T20:24:22.1716828Z 
2025-05-07T20:24:22.1716834Z 
2025-05-07T20:24:22.1716839Z 
2025-05-07T20:24:22.1716844Z 
2025-05-07T20:24:22.1717122Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:22.1717445Z 
2025-05-07T20:24:22.1717450Z 
2025-05-07T20:24:22.1717454Z 
2025-05-07T20:24:22.1717466Z 
2025-05-07T20:24:22.1717471Z 
2025-05-07T20:24:22.1717476Z 
2025-05-07T20:24:22.1717479Z 
2025-05-07T20:24:22.1727012Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:22.1727356Z 
2025-05-07T20:24:22.1727361Z 
2025-05-07T20:24:22.1727366Z 
2025-05-07T20:24:22.1727369Z 
2025-05-07T20:24:22.1727373Z 
2025-05-07T20:24:22.1727377Z 
2025-05-07T20:24:22.1727381Z 
2025-05-07T20:24:22.1727384Z 
2025-05-07T20:24:22.1727913Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1728222Z 
2025-05-07T20:24:22.1728227Z 
2025-05-07T20:24:22.1728231Z 
2025-05-07T20:24:22.1728246Z 
2025-05-07T20:24:22.1728251Z 
2025-05-07T20:24:22.1728271Z 
2025-05-07T20:24:22.1728275Z 
2025-05-07T20:24:22.1728278Z 
2025-05-07T20:24:22.1732221Z 
2025-05-07T20:24:22.1740363Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1740767Z 
2025-05-07T20:24:22.1740773Z 
2025-05-07T20:24:22.1740779Z 
2025-05-07T20:24:22.1740784Z 
2025-05-07T20:24:22.1740790Z 
2025-05-07T20:24:22.1740795Z 
2025-05-07T20:24:22.1740801Z 
2025-05-07T20:24:22.1741079Z 
2025-05-07T20:24:22.1741085Z 
2025-05-07T20:24:22.1741090Z 
2025-05-07T20:24:22.2560774Z python_abi-3.10      | 4 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.2561185Z 
2025-05-07T20:24:22.2561190Z 
2025-05-07T20:24:22.2561194Z 
2025-05-07T20:24:22.2561198Z 
2025-05-07T20:24:22.2654982Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.2667316Z openssl-3.5.0        | 3.0 MB    | #####6     |  57% 
2025-05-07T20:24:22.2668009Z 
2025-05-07T20:24:22.2668512Z 
2025-05-07T20:24:22.2688042Z libgcc-15.1.0        | 810 KB    | ####7      |  47% [A[A
2025-05-07T20:24:22.2688759Z 
2025-05-07T20:24:22.2708871Z cryptography-44.0.3  | 1.5 MB    | ###1       |  31% [A
2025-05-07T20:24:22.2709143Z 
2025-05-07T20:24:22.2709147Z 
2025-05-07T20:24:22.2709151Z 
2025-05-07T20:24:22.2709382Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.2709633Z 
2025-05-07T20:24:22.2709638Z 
2025-05-07T20:24:22.2709646Z 
2025-05-07T20:24:22.3009277Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.3009608Z 
2025-05-07T20:24:22.3012290Z 
2025-05-07T20:24:22.3108693Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.3108975Z 
2025-05-07T20:24:22.3108979Z 
2025-05-07T20:24:22.3108983Z 
2025-05-07T20:24:22.3108987Z 
2025-05-07T20:24:22.3109404Z 
2025-05-07T20:24:22.3192600Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A
2025-05-07T20:24:22.3193020Z 
2025-05-07T20:24:22.3193026Z 
2025-05-07T20:24:22.3193032Z 
2025-05-07T20:24:22.3193037Z 
2025-05-07T20:24:22.3193042Z 
2025-05-07T20:24:22.3211546Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.3211842Z 
2025-05-07T20:24:22.3211846Z 
2025-05-07T20:24:22.3211850Z 
2025-05-07T20:24:22.3211854Z 
2025-05-07T20:24:22.3211858Z 
2025-05-07T20:24:22.3213766Z 
2025-05-07T20:24:22.3298471Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:22.3298866Z 
2025-05-07T20:24:22.3298873Z 
2025-05-07T20:24:22.3298877Z 
2025-05-07T20:24:22.3298882Z 
2025-05-07T20:24:22.3298887Z 
2025-05-07T20:24:22.3304098Z 
2025-05-07T20:24:22.3387403Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.3388936Z 
2025-05-07T20:24:22.3519594Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.3519868Z 
2025-05-07T20:24:22.3519872Z 
2025-05-07T20:24:22.3519876Z 
2025-05-07T20:24:22.3519880Z 
2025-05-07T20:24:22.3519898Z 
2025-05-07T20:24:22.3519902Z 
2025-05-07T20:24:22.3523613Z 
2025-05-07T20:24:22.3590455Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:22.3590867Z 
2025-05-07T20:24:22.3590871Z 
2025-05-07T20:24:22.3590875Z 
2025-05-07T20:24:22.3590879Z 
2025-05-07T20:24:22.3590883Z 
2025-05-07T20:24:22.3590897Z 
2025-05-07T20:24:22.3592253Z 
2025-05-07T20:24:22.3714101Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.3714425Z 
2025-05-07T20:24:22.3714429Z 
2025-05-07T20:24:22.3714440Z 
2025-05-07T20:24:22.3714444Z 
2025-05-07T20:24:22.3714447Z 
2025-05-07T20:24:22.3714451Z 
2025-05-07T20:24:22.3714455Z 
2025-05-07T20:24:22.3714533Z 
2025-05-07T20:24:22.3764790Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3765309Z 
2025-05-07T20:24:22.3765316Z 
2025-05-07T20:24:22.3765321Z 
2025-05-07T20:24:22.3765326Z 
2025-05-07T20:24:22.3765332Z 
2025-05-07T20:24:22.3765352Z 
2025-05-07T20:24:22.3765358Z 
2025-05-07T20:24:22.3767161Z 
2025-05-07T20:24:22.3777504Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3896903Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.3897177Z 
2025-05-07T20:24:22.3897181Z 
2025-05-07T20:24:22.3897185Z 
2025-05-07T20:24:22.3897189Z 
2025-05-07T20:24:22.3897193Z 
2025-05-07T20:24:22.3897197Z 
2025-05-07T20:24:22.3897201Z 
2025-05-07T20:24:22.3897205Z 
2025-05-07T20:24:22.3897424Z 
2025-05-07T20:24:22.3899395Z 
2025-05-07T20:24:22.3912746Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3913206Z 
2025-05-07T20:24:22.3913213Z 
2025-05-07T20:24:22.3913219Z 
2025-05-07T20:24:22.3913224Z 
2025-05-07T20:24:22.3913230Z 
2025-05-07T20:24:22.3913235Z 
2025-05-07T20:24:22.3913241Z 
2025-05-07T20:24:22.3913246Z 
2025-05-07T20:24:22.3913251Z 
2025-05-07T20:24:22.3915679Z 
2025-05-07T20:24:22.3975715Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3976012Z 
2025-05-07T20:24:22.3976016Z 
2025-05-07T20:24:22.3976020Z 
2025-05-07T20:24:22.3979333Z 
2025-05-07T20:24:22.3986131Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.3986519Z 
2025-05-07T20:24:22.3986525Z 
2025-05-07T20:24:22.3986531Z 
2025-05-07T20:24:22.3986536Z 
2025-05-07T20:24:22.3986541Z 
2025-05-07T20:24:22.3986546Z 
2025-05-07T20:24:22.3986552Z 
2025-05-07T20:24:22.3986568Z 
2025-05-07T20:24:22.3988821Z 
2025-05-07T20:24:22.3991876Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3992228Z 
2025-05-07T20:24:22.3992235Z 
2025-05-07T20:24:22.3992249Z 
2025-05-07T20:24:22.3992255Z 
2025-05-07T20:24:22.4020746Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.4021128Z 
2025-05-07T20:24:22.4021134Z 
2025-05-07T20:24:22.4021139Z 
2025-05-07T20:24:22.4021152Z 
2025-05-07T20:24:22.4021157Z 
2025-05-07T20:24:22.4021175Z 
2025-05-07T20:24:22.4021181Z 
2025-05-07T20:24:22.4021186Z 
2025-05-07T20:24:22.4023265Z 
2025-05-07T20:24:22.4077213Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.4077575Z 
2025-05-07T20:24:22.4077580Z 
2025-05-07T20:24:22.4079168Z 
2025-05-07T20:24:22.4357952Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.4358285Z 
2025-05-07T20:24:22.4358292Z 
2025-05-07T20:24:22.4358297Z 
2025-05-07T20:24:22.4358334Z 
2025-05-07T20:24:22.4358339Z 
2025-05-07T20:24:22.4362362Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.4362747Z 
2025-05-07T20:24:22.4362753Z 
2025-05-07T20:24:22.4362758Z 
2025-05-07T20:24:22.4362763Z 
2025-05-07T20:24:22.4362768Z 
2025-05-07T20:24:22.4447959Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.4448402Z 
2025-05-07T20:24:22.4448619Z 
2025-05-07T20:24:22.4455611Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.4455974Z 
2025-05-07T20:24:22.4455991Z 
2025-05-07T20:24:22.4682941Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.4683242Z 
2025-05-07T20:24:22.4683246Z 
2025-05-07T20:24:22.4683250Z 
2025-05-07T20:24:22.4683254Z 
2025-05-07T20:24:22.4683265Z 
2025-05-07T20:24:22.4683269Z 
2025-05-07T20:24:22.4685366Z 
2025-05-07T20:24:22.4691481Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.4691963Z 
2025-05-07T20:24:22.4691980Z 
2025-05-07T20:24:22.4691985Z 
2025-05-07T20:24:22.4691990Z 
2025-05-07T20:24:22.4691995Z 
2025-05-07T20:24:22.4692000Z 
2025-05-07T20:24:22.4692005Z 
2025-05-07T20:24:22.4932751Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.4933252Z 
2025-05-07T20:24:22.4933259Z 
2025-05-07T20:24:22.4933264Z 
2025-05-07T20:24:22.4933269Z 
2025-05-07T20:24:22.4933275Z 
2025-05-07T20:24:22.4933280Z 
2025-05-07T20:24:22.4933285Z 
2025-05-07T20:24:22.4934007Z 
2025-05-07T20:24:22.4941206Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.4941635Z 
2025-05-07T20:24:22.4941639Z 
2025-05-07T20:24:22.4941643Z 
2025-05-07T20:24:22.4941646Z 
2025-05-07T20:24:22.4941650Z 
2025-05-07T20:24:22.4941654Z 
2025-05-07T20:24:22.4941658Z 
2025-05-07T20:24:22.4941662Z 
2025-05-07T20:24:22.5689588Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.5690270Z 
2025-05-07T20:24:22.5690552Z 
2025-05-07T20:24:22.5690556Z 
2025-05-07T20:24:22.5690560Z 
2025-05-07T20:24:22.5690564Z 
2025-05-07T20:24:22.5690568Z 
2025-05-07T20:24:22.5694316Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.5694662Z 
2025-05-07T20:24:22.5694667Z 
2025-05-07T20:24:22.5694670Z 
2025-05-07T20:24:22.5694674Z 
2025-05-07T20:24:22.5694678Z 
2025-05-07T20:24:22.5694682Z 
2025-05-07T20:24:22.5807687Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.5808038Z 
2025-05-07T20:24:22.5808043Z 
2025-05-07T20:24:22.5808049Z 
2025-05-07T20:24:22.5808054Z 
2025-05-07T20:24:22.5808060Z 
2025-05-07T20:24:22.5808064Z 
2025-05-07T20:24:22.5808070Z 
2025-05-07T20:24:22.5808075Z 
2025-05-07T20:24:22.5808080Z 
2025-05-07T20:24:22.5808085Z 
2025-05-07T20:24:22.6229922Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6230230Z 
2025-05-07T20:24:22.6230234Z 
2025-05-07T20:24:22.6230262Z 
2025-05-07T20:24:22.6230266Z 
2025-05-07T20:24:22.6230270Z 
2025-05-07T20:24:22.6230273Z 
2025-05-07T20:24:22.6230283Z 
2025-05-07T20:24:22.6230287Z 
2025-05-07T20:24:22.6230292Z 
2025-05-07T20:24:22.6233957Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.6234331Z 
2025-05-07T20:24:22.6234347Z 
2025-05-07T20:24:22.6234351Z 
2025-05-07T20:24:22.6234354Z 
2025-05-07T20:24:22.6234358Z 
2025-05-07T20:24:22.6234362Z 
2025-05-07T20:24:22.6234365Z 
2025-05-07T20:24:22.6234382Z 
2025-05-07T20:24:22.6234386Z 
2025-05-07T20:24:22.6971604Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7047232Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.7047476Z 
2025-05-07T20:24:22.7048369Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.7048616Z 
2025-05-07T20:24:22.7056458Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.7056819Z                                                      
2025-05-07T20:24:22.7057080Z 
2025-05-07T20:24:22.7057242Z                                                      [A
2025-05-07T20:24:22.7057462Z 
2025-05-07T20:24:22.7057467Z 
2025-05-07T20:24:22.7057631Z                                                      [A[A
2025-05-07T20:24:22.7057830Z 
2025-05-07T20:24:22.7057834Z 
2025-05-07T20:24:22.7057847Z 
2025-05-07T20:24:22.7058015Z                                                      [A[A[A
2025-05-07T20:24:22.7058227Z 
2025-05-07T20:24:22.7058231Z 
2025-05-07T20:24:22.7058235Z 
2025-05-07T20:24:22.7058239Z 
2025-05-07T20:24:22.7058416Z                                                      [A[A[A[A
2025-05-07T20:24:22.7058620Z 
2025-05-07T20:24:22.7058624Z 
2025-05-07T20:24:22.7058627Z 
2025-05-07T20:24:22.7058631Z 
2025-05-07T20:24:22.7058635Z 
2025-05-07T20:24:22.7058815Z                                                      [A[A[A[A[A
2025-05-07T20:24:22.7059022Z 
2025-05-07T20:24:22.7059026Z 
2025-05-07T20:24:22.7059034Z 
2025-05-07T20:24:22.7059037Z 
2025-05-07T20:24:22.7059041Z 
2025-05-07T20:24:22.7059044Z 
2025-05-07T20:24:22.7059230Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:22.7059438Z 
2025-05-07T20:24:22.7059442Z 
2025-05-07T20:24:22.7059446Z 
2025-05-07T20:24:22.7059449Z 
2025-05-07T20:24:22.7059453Z 
2025-05-07T20:24:22.7059457Z 
2025-05-07T20:24:22.7059460Z 
2025-05-07T20:24:22.7059649Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:22.7060003Z 
2025-05-07T20:24:22.7060007Z 
2025-05-07T20:24:22.7060010Z 
2025-05-07T20:24:22.7060014Z 
2025-05-07T20:24:22.7060017Z 
2025-05-07T20:24:22.7060021Z 
2025-05-07T20:24:22.7060024Z 
2025-05-07T20:24:22.7060028Z 
2025-05-07T20:24:22.7060223Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7060435Z 
2025-05-07T20:24:22.7060439Z 
2025-05-07T20:24:22.7060442Z 
2025-05-07T20:24:22.7060446Z 
2025-05-07T20:24:22.7060449Z 
2025-05-07T20:24:22.7060695Z 
2025-05-07T20:24:22.7060699Z 
2025-05-07T20:24:22.7060702Z 
2025-05-07T20:24:22.7060706Z 
2025-05-07T20:24:22.7060916Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.7061128Z 
2025-05-07T20:24:22.7061132Z 
2025-05-07T20:24:22.7061135Z 
2025-05-07T20:24:22.7061139Z 
2025-05-07T20:24:22.7061142Z 
2025-05-07T20:24:22.7061146Z 
2025-05-07T20:24:22.7061149Z 
2025-05-07T20:24:22.7061153Z 
2025-05-07T20:24:22.7061163Z 
2025-05-07T20:24:22.7061338Z 
2025-05-07T20:24:22.7061550Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:22.8062333Z Preparing transaction: - done
2025-05-07T20:24:22.9067516Z Verifying transaction: | done
2025-05-07T20:24:24.4092299Z Executing transaction: - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:24.5889415Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:26.3309730Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:26.3322876Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:26.3346050Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:27.1964568Z Channels:
2025-05-07T20:24:27.1964809Z  - conda-forge
2025-05-07T20:24:27.1965035Z Platform: linux-64
2025-05-07T20:24:30.5066142Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:30.8740905Z Solving environment: \ done
2025-05-07T20:24:30.9346666Z 
2025-05-07T20:24:30.9347220Z ## Package Plan ##
2025-05-07T20:24:30.9347402Z 
2025-05-07T20:24:30.9347614Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:30.9347923Z 
2025-05-07T20:24:30.9348024Z   added / updated specs:
2025-05-07T20:24:30.9348272Z     - libxcrypt
2025-05-07T20:24:30.9348405Z 
2025-05-07T20:24:30.9348410Z 
2025-05-07T20:24:30.9348537Z The following packages will be downloaded:
2025-05-07T20:24:30.9348752Z 
2025-05-07T20:24:30.9348869Z     package                    |            build
2025-05-07T20:24:30.9349208Z     ---------------------------|-----------------
2025-05-07T20:24:30.9349589Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:30.9349992Z     ------------------------------------------------------------
2025-05-07T20:24:30.9350330Z                                            Total:          98 KB
2025-05-07T20:24:30.9350542Z 
2025-05-07T20:24:30.9350677Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:30.9350901Z 
2025-05-07T20:24:30.9351130Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:30.9351417Z 
2025-05-07T20:24:30.9351421Z 
2025-05-07T20:24:30.9351426Z 
2025-05-07T20:24:30.9351576Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:31.0721398Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:31.0743200Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:31.0848555Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:31.0851675Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:31.0852071Z                                                      
2025-05-07T20:24:31.0852340Z  done
2025-05-07T20:24:31.1857556Z Preparing transaction: / done
2025-05-07T20:24:31.2861805Z Verifying transaction: \ done
2025-05-07T20:24:31.3867020Z Executing transaction: / done
2025-05-07T20:24:34.8339173Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:34.8339981Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.10/crypt.h
2025-05-07T20:24:34.8340528Z 
2025-05-07T20:24:34.8371479Z 
2025-05-07T20:24:36.4692656Z [SETUP] Installed Python version: Python 3.10.16
2025-05-07T20:24:36.4693122Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:36.4726989Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:36.4727473Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:36.4742166Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:36.4742514Z env:
2025-05-07T20:24:36.4742740Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:36.4743036Z   BUILD_ENV: build_binary
2025-05-07T20:24:36.4743276Z   BUILD_TARGET: genai
2025-05-07T20:24:36.4743506Z   BUILD_VARIANT: cuda
2025-05-07T20:24:36.4743740Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:36.4743994Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:36.4744294Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:36.4744612Z ##[endgroup]
2025-05-07T20:24:36.8180631Z ################################################################################
2025-05-07T20:24:36.8180994Z # Install C/C++ Compilers
2025-05-07T20:24:36.8181228Z #
2025-05-07T20:24:36.8198254Z # [2025-05-07T20:24:36.819Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:36.8198686Z ################################################################################
2025-05-07T20:24:36.8206904Z 
2025-05-07T20:24:36.8216006Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:36.9188371Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:36.9199457Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:36.9222807Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:37.7899093Z Channels:
2025-05-07T20:24:37.7899354Z  - conda-forge
2025-05-07T20:24:37.7899598Z Platform: linux-64
2025-05-07T20:24:41.1291177Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:41.4961487Z Solving environment: \ done
2025-05-07T20:24:41.5576640Z 
2025-05-07T20:24:41.5577054Z ## Package Plan ##
2025-05-07T20:24:41.5577211Z 
2025-05-07T20:24:41.5577465Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:41.5577788Z 
2025-05-07T20:24:41.5577892Z   added / updated specs:
2025-05-07T20:24:41.5578182Z     - sysroot_linux-64=2.17
2025-05-07T20:24:41.5578354Z 
2025-05-07T20:24:41.5578358Z 
2025-05-07T20:24:41.5578500Z The following packages will be downloaded:
2025-05-07T20:24:41.5578716Z 
2025-05-07T20:24:41.5578851Z     package                    |            build
2025-05-07T20:24:41.5579175Z     ---------------------------|-----------------
2025-05-07T20:24:41.5579604Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:41.5580401Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:41.5581008Z     ------------------------------------------------------------
2025-05-07T20:24:41.5581510Z                                            Total:        15.4 MB
2025-05-07T20:24:41.5581831Z 
2025-05-07T20:24:41.5582009Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:41.5582332Z 
2025-05-07T20:24:41.5582697Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:41.5583265Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:41.5583574Z 
2025-05-07T20:24:41.5583578Z 
2025-05-07T20:24:41.5583583Z 
2025-05-07T20:24:41.5583727Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:41.5584114Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.5584341Z 
2025-05-07T20:24:41.7808983Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:41.8110450Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.8111059Z 
2025-05-07T20:24:41.8187782Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:41.8190141Z 
2025-05-07T20:24:41.8820374Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.9604733Z sysroot_linux-64-2.1 | 14.5 MB   | ######2    |  62% 
2025-05-07T20:24:42.0646092Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.0646350Z 
2025-05-07T20:24:42.0648204Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.0648656Z 
2025-05-07T20:24:42.5398066Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.5398628Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.5403977Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.5404320Z                                                      
2025-05-07T20:24:42.5404518Z 
2025-05-07T20:24:42.5405689Z                                                      [A done
2025-05-07T20:24:42.6408795Z Preparing transaction: / done
2025-05-07T20:24:42.8415329Z Verifying transaction: \ | done
2025-05-07T20:24:43.0485938Z Executing transaction: - \ done
2025-05-07T20:24:43.2043525Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:43.2043890Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:44.8886498Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:44.8900184Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:44.8924035Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:45.7765260Z Channels:
2025-05-07T20:24:45.7765874Z  - conda-forge
2025-05-07T20:24:45.7766391Z Platform: linux-64
2025-05-07T20:24:49.0827420Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:50.0377440Z Solving environment: \ | / done
2025-05-07T20:24:50.1020776Z 
2025-05-07T20:24:50.1021060Z ## Package Plan ##
2025-05-07T20:24:50.1021269Z 
2025-05-07T20:24:50.1021567Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:50.1022054Z 
2025-05-07T20:24:50.1022164Z   added / updated specs:
2025-05-07T20:24:50.1022422Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:50.1022594Z 
2025-05-07T20:24:50.1022654Z 
2025-05-07T20:24:50.1022780Z The following packages will be downloaded:
2025-05-07T20:24:50.1023004Z 
2025-05-07T20:24:50.1023121Z     package                    |            build
2025-05-07T20:24:50.1023460Z     ---------------------------|-----------------
2025-05-07T20:24:50.1023993Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:50.1024473Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:50.1024942Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:50.1025392Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:50.1025829Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:50.1026380Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:50.1026816Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:50.1027290Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:50.1027758Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:50.1028323Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:50.1028795Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:50.1029264Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:50.1029666Z     ------------------------------------------------------------
2025-05-07T20:24:50.1030009Z                                            Total:        91.6 MB
2025-05-07T20:24:50.1030214Z 
2025-05-07T20:24:50.1030352Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:50.1030569Z 
2025-05-07T20:24:50.1030833Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:50.1031710Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:50.1032256Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:50.1033065Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:50.1033562Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:50.1034192Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:50.1034713Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:50.1035270Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:50.1035752Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:50.1036292Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:50.1036646Z 
2025-05-07T20:24:50.1036767Z The following packages will be UPDATED:
2025-05-07T20:24:50.1036975Z 
2025-05-07T20:24:50.1037287Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:50.1037998Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:50.1038409Z 
2025-05-07T20:24:50.1038414Z 
2025-05-07T20:24:50.1038418Z 
2025-05-07T20:24:50.1038557Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:50.1038932Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:50.1039156Z 
2025-05-07T20:24:50.1039526Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:50.1039758Z 
2025-05-07T20:24:50.1039762Z 
2025-05-07T20:24:50.1048118Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:50.1048373Z 
2025-05-07T20:24:50.1048377Z 
2025-05-07T20:24:50.1048384Z 
2025-05-07T20:24:50.1069877Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:50.1070162Z 
2025-05-07T20:24:50.1070166Z 
2025-05-07T20:24:50.1070170Z 
2025-05-07T20:24:50.1070174Z 
2025-05-07T20:24:50.1090125Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:50.1090444Z 
2025-05-07T20:24:50.1090449Z 
2025-05-07T20:24:50.1090461Z 
2025-05-07T20:24:50.1090465Z 
2025-05-07T20:24:50.1090470Z 
2025-05-07T20:24:50.1097188Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:50.1097674Z 
2025-05-07T20:24:50.1097680Z 
2025-05-07T20:24:50.1097696Z 
2025-05-07T20:24:50.1097702Z 
2025-05-07T20:24:50.1097707Z 
2025-05-07T20:24:50.1097712Z 
2025-05-07T20:24:50.1098991Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:50.1099468Z 
2025-05-07T20:24:50.1099486Z 
2025-05-07T20:24:50.1099492Z 
2025-05-07T20:24:50.1099497Z 
2025-05-07T20:24:50.1099503Z 
2025-05-07T20:24:50.1099508Z 
2025-05-07T20:24:50.1099514Z 
2025-05-07T20:24:50.1101577Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:50.1102050Z 
2025-05-07T20:24:50.1102056Z 
2025-05-07T20:24:50.1102061Z 
2025-05-07T20:24:50.1102067Z 
2025-05-07T20:24:50.1102080Z 
2025-05-07T20:24:50.1102085Z 
2025-05-07T20:24:50.1102090Z 
2025-05-07T20:24:50.1102096Z 
2025-05-07T20:24:50.1103367Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1103785Z 
2025-05-07T20:24:50.1103791Z 
2025-05-07T20:24:50.1103796Z 
2025-05-07T20:24:50.1103801Z 
2025-05-07T20:24:50.1103806Z 
2025-05-07T20:24:50.1103811Z 
2025-05-07T20:24:50.1103816Z 
2025-05-07T20:24:50.1103822Z 
2025-05-07T20:24:50.1103827Z 
2025-05-07T20:24:50.1120359Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1120652Z 
2025-05-07T20:24:50.1120656Z 
2025-05-07T20:24:50.1120660Z 
2025-05-07T20:24:50.1120663Z 
2025-05-07T20:24:50.1120667Z 
2025-05-07T20:24:50.1120670Z 
2025-05-07T20:24:50.1120674Z 
2025-05-07T20:24:50.1120678Z 
2025-05-07T20:24:50.1120681Z 
2025-05-07T20:24:50.1121469Z 
2025-05-07T20:24:50.1130157Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1130612Z 
2025-05-07T20:24:50.1130618Z 
2025-05-07T20:24:50.1130624Z 
2025-05-07T20:24:50.1130627Z 
2025-05-07T20:24:50.1130631Z 
2025-05-07T20:24:50.1130635Z 
2025-05-07T20:24:50.1130639Z 
2025-05-07T20:24:50.1130643Z 
2025-05-07T20:24:50.1130655Z 
2025-05-07T20:24:50.1130658Z 
2025-05-07T20:24:50.1140563Z 
2025-05-07T20:24:50.2332487Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2332840Z 
2025-05-07T20:24:50.2332854Z 
2025-05-07T20:24:50.2337398Z 
2025-05-07T20:24:50.2354053Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:50.2354325Z 
2025-05-07T20:24:50.2354329Z 
2025-05-07T20:24:50.2354342Z 
2025-05-07T20:24:50.2358370Z 
2025-05-07T20:24:50.3405035Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:50.3405311Z 
2025-05-07T20:24:50.3405324Z 
2025-05-07T20:24:50.3405328Z 
2025-05-07T20:24:50.3560607Z binutils_impl_linux- | 6.0 MB    | 8          |   9% [A[A[A
2025-05-07T20:24:50.3560890Z 
2025-05-07T20:24:50.3560895Z 
2025-05-07T20:24:50.3560899Z 
2025-05-07T20:24:50.3593367Z 
2025-05-07T20:24:50.3710637Z libstdcxx-15.1.0     | 3.7 MB    | #6         |  16% [A[A[A[A
2025-05-07T20:24:50.3710920Z 
2025-05-07T20:24:50.4128904Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:50.4129161Z 
2025-05-07T20:24:50.4129405Z 
2025-05-07T20:24:50.4129410Z 
2025-05-07T20:24:50.4129466Z 
2025-05-07T20:24:50.4214155Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.4405606Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:50.4405860Z 
2025-05-07T20:24:50.4405864Z 
2025-05-07T20:24:50.4405910Z 
2025-05-07T20:24:50.4531283Z binutils_impl_linux- | 6.0 MB    | #######2   |  72% [A[A[A
2025-05-07T20:24:50.4531571Z 
2025-05-07T20:24:50.4534110Z 
2025-05-07T20:24:50.4608715Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:50.4608998Z 
2025-05-07T20:24:50.4609002Z 
2025-05-07T20:24:50.4609012Z 
2025-05-07T20:24:50.4609016Z 
2025-05-07T20:24:50.4611172Z 
2025-05-07T20:24:50.4712109Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:50.4712404Z 
2025-05-07T20:24:50.5217408Z gxx_impl_linux-64-11 | 11.2 MB   | ###3       |  33% [A
2025-05-07T20:24:50.5532339Z gcc_impl_linux-64-11 | 53.0 MB   | 5          |   6% 
2025-05-07T20:24:50.5532633Z 
2025-05-07T20:24:50.5532637Z 
2025-05-07T20:24:50.5713270Z libstdcxx-devel_linu | 11.1 MB   | ##5        |  25% [A[A
2025-05-07T20:24:50.5714202Z 
2025-05-07T20:24:50.6217127Z gxx_impl_linux-64-11 | 11.2 MB   | ######2    |  62% [A
2025-05-07T20:24:50.6405339Z gcc_impl_linux-64-11 | 53.0 MB   | #2         |  12% 
2025-05-07T20:24:50.6405653Z 
2025-05-07T20:24:50.6405657Z 
2025-05-07T20:24:50.6405661Z 
2025-05-07T20:24:50.6405666Z 
2025-05-07T20:24:50.6405671Z 
2025-05-07T20:24:50.6406656Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.6406947Z 
2025-05-07T20:24:50.6406958Z 
2025-05-07T20:24:50.6406971Z 
2025-05-07T20:24:50.6406975Z 
2025-05-07T20:24:50.6414476Z 
2025-05-07T20:24:50.6538646Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.6538944Z 
2025-05-07T20:24:50.6539476Z 
2025-05-07T20:24:50.6722272Z libstdcxx-devel_linu | 11.1 MB   | #####7     |  57% [A[A
2025-05-07T20:24:50.6722536Z 
2025-05-07T20:24:50.6820754Z gxx_impl_linux-64-11 | 11.2 MB   | #########3 |  94% [A
2025-05-07T20:24:50.6821072Z 
2025-05-07T20:24:50.6821078Z 
2025-05-07T20:24:50.6827262Z 
2025-05-07T20:24:50.7087971Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:50.7088329Z 
2025-05-07T20:24:50.7088333Z 
2025-05-07T20:24:50.7088337Z 
2025-05-07T20:24:50.7088341Z 
2025-05-07T20:24:50.7088346Z 
2025-05-07T20:24:50.7090727Z 
2025-05-07T20:24:50.7218869Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:50.7352070Z gcc_impl_linux-64-11 | 53.0 MB   | #7         |  18% 
2025-05-07T20:24:50.7352473Z 
2025-05-07T20:24:50.7352790Z 
2025-05-07T20:24:50.7352795Z 
2025-05-07T20:24:50.7352800Z 
2025-05-07T20:24:50.7352805Z 
2025-05-07T20:24:50.7352810Z 
2025-05-07T20:24:50.7352815Z 
2025-05-07T20:24:50.7544610Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:50.7545055Z 
2025-05-07T20:24:50.7546391Z 
2025-05-07T20:24:50.7944276Z libstdcxx-devel_linu | 11.1 MB   | ########3  |  83% [A[A
2025-05-07T20:24:50.7944557Z 
2025-05-07T20:24:50.7944561Z 
2025-05-07T20:24:50.7944564Z 
2025-05-07T20:24:50.7944568Z 
2025-05-07T20:24:50.7944572Z 
2025-05-07T20:24:50.7944576Z 
2025-05-07T20:24:50.7946029Z 
2025-05-07T20:24:50.8222439Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.8603735Z gcc_impl_linux-64-11 | 53.0 MB   | ##2        |  23% 
2025-05-07T20:24:50.8603974Z 
2025-05-07T20:24:50.8604104Z 
2025-05-07T20:24:50.8604113Z 
2025-05-07T20:24:50.8604134Z 
2025-05-07T20:24:50.8604200Z 
2025-05-07T20:24:50.8604301Z 
2025-05-07T20:24:50.8604327Z 
2025-05-07T20:24:50.8607396Z 
2025-05-07T20:24:50.8667571Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.8667887Z 
2025-05-07T20:24:50.8667891Z 
2025-05-07T20:24:50.8667895Z 
2025-05-07T20:24:50.8667899Z 
2025-05-07T20:24:50.8667903Z 
2025-05-07T20:24:50.8667906Z 
2025-05-07T20:24:50.8667920Z 
2025-05-07T20:24:50.8668008Z 
2025-05-07T20:24:50.8931963Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.8932299Z 
2025-05-07T20:24:50.8932305Z 
2025-05-07T20:24:50.8932310Z 
2025-05-07T20:24:50.8932315Z 
2025-05-07T20:24:50.8932619Z 
2025-05-07T20:24:50.8934781Z 
2025-05-07T20:24:50.8935319Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.8935636Z 
2025-05-07T20:24:50.8935641Z 
2025-05-07T20:24:50.8935656Z 
2025-05-07T20:24:50.8935679Z 
2025-05-07T20:24:50.8935685Z 
2025-05-07T20:24:50.8935691Z 
2025-05-07T20:24:50.9133479Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.9133823Z 
2025-05-07T20:24:50.9133827Z 
2025-05-07T20:24:50.9133831Z 
2025-05-07T20:24:50.9133835Z 
2025-05-07T20:24:50.9133839Z 
2025-05-07T20:24:50.9133843Z 
2025-05-07T20:24:50.9133846Z 
2025-05-07T20:24:50.9133850Z 
2025-05-07T20:24:50.9134541Z 
2025-05-07T20:24:50.9144936Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9145226Z 
2025-05-07T20:24:50.9145230Z 
2025-05-07T20:24:50.9145234Z 
2025-05-07T20:24:50.9147387Z 
2025-05-07T20:24:50.9162431Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.9162715Z 
2025-05-07T20:24:50.9162719Z 
2025-05-07T20:24:50.9162723Z 
2025-05-07T20:24:50.9164250Z 
2025-05-07T20:24:50.9177089Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.9177388Z 
2025-05-07T20:24:50.9177415Z 
2025-05-07T20:24:50.9177419Z 
2025-05-07T20:24:50.9177423Z 
2025-05-07T20:24:50.9177427Z 
2025-05-07T20:24:50.9177435Z 
2025-05-07T20:24:50.9177439Z 
2025-05-07T20:24:50.9177443Z 
2025-05-07T20:24:50.9180469Z 
2025-05-07T20:24:50.9224852Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9415658Z gcc_impl_linux-64-11 | 53.0 MB   | ##9        |  29% 
2025-05-07T20:24:50.9415914Z 
2025-05-07T20:24:50.9415918Z 
2025-05-07T20:24:50.9415930Z 
2025-05-07T20:24:50.9415934Z 
2025-05-07T20:24:50.9415937Z 
2025-05-07T20:24:50.9415941Z 
2025-05-07T20:24:50.9415945Z 
2025-05-07T20:24:50.9415950Z 
2025-05-07T20:24:50.9415954Z 
2025-05-07T20:24:50.9415957Z 
2025-05-07T20:24:50.9452703Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9453149Z 
2025-05-07T20:24:50.9453156Z 
2025-05-07T20:24:50.9453161Z 
2025-05-07T20:24:50.9453167Z 
2025-05-07T20:24:50.9453172Z 
2025-05-07T20:24:50.9453178Z 
2025-05-07T20:24:50.9453470Z 
2025-05-07T20:24:50.9453476Z 
2025-05-07T20:24:50.9453479Z 
2025-05-07T20:24:50.9453483Z 
2025-05-07T20:24:50.9732228Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9732546Z 
2025-05-07T20:24:50.9732551Z 
2025-05-07T20:24:50.9732555Z 
2025-05-07T20:24:50.9732558Z 
2025-05-07T20:24:50.9732562Z 
2025-05-07T20:24:50.9732566Z 
2025-05-07T20:24:50.9732570Z 
2025-05-07T20:24:50.9732574Z 
2025-05-07T20:24:50.9732578Z 
2025-05-07T20:24:50.9732581Z 
2025-05-07T20:24:50.9732585Z 
2025-05-07T20:24:50.9773470Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9773787Z 
2025-05-07T20:24:50.9773791Z 
2025-05-07T20:24:50.9773794Z 
2025-05-07T20:24:50.9773798Z 
2025-05-07T20:24:50.9773810Z 
2025-05-07T20:24:50.9773814Z 
2025-05-07T20:24:50.9773818Z 
2025-05-07T20:24:50.9773821Z 
2025-05-07T20:24:50.9773827Z 
2025-05-07T20:24:50.9773831Z 
2025-05-07T20:24:50.9776921Z 
2025-05-07T20:24:51.0227913Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.0433242Z gcc_impl_linux-64-11 | 53.0 MB   | ###6       |  36% 
2025-05-07T20:24:51.0433593Z 
2025-05-07T20:24:51.0433597Z 
2025-05-07T20:24:51.0433601Z 
2025-05-07T20:24:51.0433621Z 
2025-05-07T20:24:51.0433624Z 
2025-05-07T20:24:51.0433628Z 
2025-05-07T20:24:51.0437253Z 
2025-05-07T20:24:51.0453548Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.0453930Z 
2025-05-07T20:24:51.0453934Z 
2025-05-07T20:24:51.0453938Z 
2025-05-07T20:24:51.0453942Z 
2025-05-07T20:24:51.0453945Z 
2025-05-07T20:24:51.0453949Z 
2025-05-07T20:24:51.0453953Z 
2025-05-07T20:24:51.0742919Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:51.0743231Z 
2025-05-07T20:24:51.1073777Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:51.1074086Z 
2025-05-07T20:24:51.1077083Z 
2025-05-07T20:24:51.1229256Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.1436311Z gcc_impl_linux-64-11 | 53.0 MB   | ####3      |  44% 
2025-05-07T20:24:51.1436623Z 
2025-05-07T20:24:51.1436629Z 
2025-05-07T20:24:51.1436634Z 
2025-05-07T20:24:51.1436639Z 
2025-05-07T20:24:51.1436644Z 
2025-05-07T20:24:51.1436649Z 
2025-05-07T20:24:51.1436655Z 
2025-05-07T20:24:51.1436660Z 
2025-05-07T20:24:51.1441385Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1441914Z 
2025-05-07T20:24:51.1441918Z 
2025-05-07T20:24:51.1441921Z 
2025-05-07T20:24:51.1441925Z 
2025-05-07T20:24:51.1441929Z 
2025-05-07T20:24:51.1441932Z 
2025-05-07T20:24:51.1441936Z 
2025-05-07T20:24:51.1442290Z 
2025-05-07T20:24:51.1920189Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1920515Z 
2025-05-07T20:24:51.1920520Z 
2025-05-07T20:24:51.1920524Z 
2025-05-07T20:24:51.1920528Z 
2025-05-07T20:24:51.1920531Z 
2025-05-07T20:24:51.2234861Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:51.2600220Z gcc_impl_linux-64-11 | 53.0 MB   | #####3     |  54% 
2025-05-07T20:24:51.2600630Z 
2025-05-07T20:24:51.2600636Z 
2025-05-07T20:24:51.2600642Z 
2025-05-07T20:24:51.2600647Z 
2025-05-07T20:24:51.2600652Z 
2025-05-07T20:24:51.2600657Z 
2025-05-07T20:24:51.2600663Z 
2025-05-07T20:24:51.2600668Z 
2025-05-07T20:24:51.2600674Z 
2025-05-07T20:24:51.2607523Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.2607909Z 
2025-05-07T20:24:51.2607913Z 
2025-05-07T20:24:51.2607917Z 
2025-05-07T20:24:51.2607921Z 
2025-05-07T20:24:51.2607925Z 
2025-05-07T20:24:51.2607929Z 
2025-05-07T20:24:51.2607932Z 
2025-05-07T20:24:51.2607936Z 
2025-05-07T20:24:51.2607940Z 
2025-05-07T20:24:51.3197998Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3198427Z 
2025-05-07T20:24:51.3198431Z 
2025-05-07T20:24:51.3198435Z 
2025-05-07T20:24:51.3198439Z 
2025-05-07T20:24:51.3198726Z 
2025-05-07T20:24:51.3198731Z 
2025-05-07T20:24:51.3198735Z 
2025-05-07T20:24:51.3198738Z 
2025-05-07T20:24:51.3198742Z 
2025-05-07T20:24:51.3198891Z 
2025-05-07T20:24:51.3201345Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3201731Z 
2025-05-07T20:24:51.3201737Z 
2025-05-07T20:24:51.3201743Z 
2025-05-07T20:24:51.3201748Z 
2025-05-07T20:24:51.3201754Z 
2025-05-07T20:24:51.3201760Z 
2025-05-07T20:24:51.3201765Z 
2025-05-07T20:24:51.3201770Z 
2025-05-07T20:24:51.3201775Z 
2025-05-07T20:24:51.3203718Z 
2025-05-07T20:24:51.3235881Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3680537Z gcc_impl_linux-64-11 | 53.0 MB   | ######3    |  64% 
2025-05-07T20:24:51.3680903Z 
2025-05-07T20:24:51.3680908Z 
2025-05-07T20:24:51.3680915Z 
2025-05-07T20:24:51.3680920Z 
2025-05-07T20:24:51.3680925Z 
2025-05-07T20:24:51.3680929Z 
2025-05-07T20:24:51.3833561Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:51.3833948Z 
2025-05-07T20:24:51.3833952Z 
2025-05-07T20:24:51.3833956Z 
2025-05-07T20:24:51.3833969Z 
2025-05-07T20:24:51.3833973Z 
2025-05-07T20:24:51.3833978Z 
2025-05-07T20:24:51.3833982Z 
2025-05-07T20:24:51.3833985Z 
2025-05-07T20:24:51.3833998Z 
2025-05-07T20:24:51.3834001Z 
2025-05-07T20:24:51.3835599Z 
2025-05-07T20:24:51.3840836Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3841134Z 
2025-05-07T20:24:51.3841317Z 
2025-05-07T20:24:51.3841345Z 
2025-05-07T20:24:51.3841352Z 
2025-05-07T20:24:51.3841357Z 
2025-05-07T20:24:51.3841363Z 
2025-05-07T20:24:51.3841369Z 
2025-05-07T20:24:51.3841481Z 
2025-05-07T20:24:51.3841487Z 
2025-05-07T20:24:51.3841499Z 
2025-05-07T20:24:51.3841569Z 
2025-05-07T20:24:51.4236775Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.5236973Z gcc_impl_linux-64-11 | 53.0 MB   | #######2   |  73% 
2025-05-07T20:24:51.6238055Z gcc_impl_linux-64-11 | 53.0 MB   | ########1  |  82% 
2025-05-07T20:24:51.6969650Z gcc_impl_linux-64-11 | 53.0 MB   | #########4 |  94% 
2025-05-07T20:24:51.6970053Z 
2025-05-07T20:24:51.6970059Z 
2025-05-07T20:24:51.6970267Z 
2025-05-07T20:24:51.8160054Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:51.8160348Z 
2025-05-07T20:24:51.8313406Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:52.1461042Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.1461355Z 
2025-05-07T20:24:52.1461360Z 
2025-05-07T20:24:52.5737456Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:52.5743905Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.5744330Z                                                      
2025-05-07T20:24:52.5744602Z 
2025-05-07T20:24:52.5744893Z                                                      [A
2025-05-07T20:24:52.5745174Z 
2025-05-07T20:24:52.5745201Z 
2025-05-07T20:24:52.5745464Z                                                      [A[A
2025-05-07T20:24:52.5745749Z 
2025-05-07T20:24:52.5745768Z 
2025-05-07T20:24:52.5745773Z 
2025-05-07T20:24:52.5745995Z                                                      [A[A[A
2025-05-07T20:24:52.5746293Z 
2025-05-07T20:24:52.5746298Z 
2025-05-07T20:24:52.5746303Z 
2025-05-07T20:24:52.5746308Z 
2025-05-07T20:24:52.5746555Z                                                      [A[A[A[A
2025-05-07T20:24:52.5746859Z 
2025-05-07T20:24:52.5746864Z 
2025-05-07T20:24:52.5746869Z 
2025-05-07T20:24:52.5746874Z 
2025-05-07T20:24:52.5746879Z 
2025-05-07T20:24:52.5747117Z                                                      [A[A[A[A[A
2025-05-07T20:24:52.5747417Z 
2025-05-07T20:24:52.5747422Z 
2025-05-07T20:24:52.5747427Z 
2025-05-07T20:24:52.5747432Z 
2025-05-07T20:24:52.5747437Z 
2025-05-07T20:24:52.5747442Z 
2025-05-07T20:24:52.5747705Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:52.5748174Z 
2025-05-07T20:24:52.5748178Z 
2025-05-07T20:24:52.5748182Z 
2025-05-07T20:24:52.5748186Z 
2025-05-07T20:24:52.5748321Z 
2025-05-07T20:24:52.5748326Z 
2025-05-07T20:24:52.5748330Z 
2025-05-07T20:24:52.5748546Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:52.5748863Z 
2025-05-07T20:24:52.5748869Z 
2025-05-07T20:24:52.5748874Z 
2025-05-07T20:24:52.5748879Z 
2025-05-07T20:24:52.5748884Z 
2025-05-07T20:24:52.5748889Z 
2025-05-07T20:24:52.5748895Z 
2025-05-07T20:24:52.5748900Z 
2025-05-07T20:24:52.5749094Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5749317Z 
2025-05-07T20:24:52.5749320Z 
2025-05-07T20:24:52.5749324Z 
2025-05-07T20:24:52.5749327Z 
2025-05-07T20:24:52.5749331Z 
2025-05-07T20:24:52.5749335Z 
2025-05-07T20:24:52.5749339Z 
2025-05-07T20:24:52.5749342Z 
2025-05-07T20:24:52.5749346Z 
2025-05-07T20:24:52.5749539Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5749760Z 
2025-05-07T20:24:52.5749764Z 
2025-05-07T20:24:52.5749767Z 
2025-05-07T20:24:52.5749771Z 
2025-05-07T20:24:52.5749780Z 
2025-05-07T20:24:52.5749784Z 
2025-05-07T20:24:52.5749788Z 
2025-05-07T20:24:52.5749791Z 
2025-05-07T20:24:52.5749795Z 
2025-05-07T20:24:52.5749798Z 
2025-05-07T20:24:52.5749992Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5750211Z 
2025-05-07T20:24:52.5750215Z 
2025-05-07T20:24:52.5750230Z 
2025-05-07T20:24:52.5750234Z 
2025-05-07T20:24:52.5750238Z 
2025-05-07T20:24:52.5750242Z 
2025-05-07T20:24:52.5750246Z 
2025-05-07T20:24:52.5750250Z 
2025-05-07T20:24:52.5750253Z 
2025-05-07T20:24:52.5750257Z 
2025-05-07T20:24:52.5750261Z 
2025-05-07T20:24:52.5750466Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:52.6749611Z Preparing transaction: \ done
2025-05-07T20:24:52.9756091Z Verifying transaction: / - \ done
2025-05-07T20:24:53.0766073Z Executing transaction: / done
2025-05-07T20:24:53.2424641Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:57.1506935Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:57.1507482Z 
2025-05-07T20:24:57.1518449Z 
2025-05-07T20:24:57.1537639Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:57.1538164Z 
2025-05-07T20:24:57.1550708Z 
2025-05-07T20:24:57.1567804Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:57.1568319Z 
2025-05-07T20:24:57.1579718Z 
2025-05-07T20:24:57.1597868Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:57.1598396Z 
2025-05-07T20:24:57.1611117Z 
2025-05-07T20:24:59.0412759Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:59.0413041Z 
2025-05-07T20:24:59.1049737Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:00.9854065Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:00.9854351Z 
2025-05-07T20:25:01.0485249Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:02.9433445Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:02.9433841Z 
2025-05-07T20:25:03.0070235Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:04.8870958Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:04.8871374Z 
2025-05-07T20:25:04.9516424Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:04.9521071Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:04.9522207Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:04.9522780Z 
2025-05-07T20:25:06.8398053Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:06.8398477Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:06.8399137Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:06.8399411Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:06.8399750Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:06.8400258Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:06.8400555Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:06.8400861Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:06.8401121Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:06.8401373Z #define __CHAR_BIT__ 8
2025-05-07T20:25:06.8401612Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:06.8401856Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:06.8402113Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:06.8402389Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:06.8402655Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:06.8402952Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8403256Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:06.8403546Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:06.8403875Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:06.8404203Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:06.8404621Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:06.8405023Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:06.8405341Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:06.8405622Z #define __GCC_IEC_559 2
2025-05-07T20:25:06.8405862Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:06.8406135Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:06.8406402Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:06.8406675Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:06.8407009Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8407335Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:06.8407611Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:06.8407880Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:06.8408149Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:06.8408419Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:06.8408675Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:06.8408941Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:06.8409209Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:06.8409457Z #define __INT8_C(c) c
2025-05-07T20:25:06.8409698Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:06.8409995Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8410302Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:06.8410615Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:06.8410964Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:06.8411232Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:06.8411500Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8411783Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:06.8412065Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:06.8412442Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:06.8412863Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:06.8413150Z #define __linux 1
2025-05-07T20:25:06.8413372Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:06.8413657Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:06.8413936Z #define __unix 1
2025-05-07T20:25:06.8414160Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:06.8414436Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:06.8414712Z #define __WINT_MIN__ 0U
2025-05-07T20:25:06.8414950Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:06.8415230Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:06.8415503Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:06.8415764Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:06.8416017Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:06.8416299Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:06.8416592Z #define __INT64_C(c) c ## L
2025-05-07T20:25:06.8416853Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:06.8417147Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:06.8417501Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:06.8417848Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:06.8418296Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:06.8418548Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:06.8418801Z #define __DBL_DIG__ 15
2025-05-07T20:25:06.8419033Z #define __FLT32_DIG__ 6
2025-05-07T20:25:06.8419335Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:06.8419677Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:06.8420009Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:06.8420335Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:06.8420675Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:06.8420918Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:06.8421179Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:06.8421553Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:06.8421939Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:06.8422216Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:06.8422471Z #define __unix__ 1
2025-05-07T20:25:06.8422685Z #define __INT_WIDTH__ 32
2025-05-07T20:25:06.8422937Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:06.8423181Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:06.8423425Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:06.8423688Z #define __UINT16_C(c) c
2025-05-07T20:25:06.8423930Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:06.8424175Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:06.8424527Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:06.8424885Z #define __gnu_linux__ 1
2025-05-07T20:25:06.8425125Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:06.8425390Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:06.8425681Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8425945Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:06.8426198Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:06.8426447Z #define __GNUC__ 11
2025-05-07T20:25:06.8426667Z #define __pie__ 2
2025-05-07T20:25:06.8426884Z #define __MMX__ 1
2025-05-07T20:25:06.8427110Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:06.8427377Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:06.8427645Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:06.8427915Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:06.8428258Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:06.8428644Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8428980Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:06.8429234Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:06.8429499Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:06.8429794Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:06.8430062Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:06.8430317Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:06.8430600Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:06.8440159Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:06.8440474Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:06.8440762Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:06.8441028Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:06.8441312Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:06.8441584Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:06.8441852Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:06.8442103Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:06.8442413Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:06.8442778Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:06.8443053Z #define __SSE2_MATH__ 1
2025-05-07T20:25:06.8443307Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:06.8443604Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8443903Z #define __amd64 1
2025-05-07T20:25:06.8444137Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:06.8444406Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:06.8444702Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:06.8445123Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:06.8445372Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:06.8445652Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:06.8446014Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:06.8446272Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:06.8446534Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:06.8446792Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:06.8447045Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:06.8447319Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:06.8447567Z #define __x86_64 1
2025-05-07T20:25:06.8447797Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:06.8448156Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:06.8448617Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:06.8449068Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:06.8449580Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:06.8449961Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:06.8450206Z #define __LP64__ 1
2025-05-07T20:25:06.8450446Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8450793Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:06.8451161Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:06.8451436Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:06.8451712Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:06.8451987Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:06.8452264Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:06.8452533Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:06.8452791Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:06.8453046Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:06.8453307Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:06.8453636Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:06.8453982Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:06.8454265Z #define __FLT_DIG__ 6
2025-05-07T20:25:06.8454500Z #define __NO_INLINE__ 1
2025-05-07T20:25:06.8454737Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:06.8455069Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:06.8455421Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:06.8455672Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:06.8455934Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:06.8456193Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:06.8456444Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:06.8456705Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:06.8456997Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:06.8457285Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:06.8457550Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:06.8457857Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:06.8458183Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:06.8458443Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:06.8458708Z #define __FLT128_DIG__ 33
2025-05-07T20:25:06.8458949Z #define __INT32_C(c) c
2025-05-07T20:25:06.8459184Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:06.8459475Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:06.8459757Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:06.8460124Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:06.8460438Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:06.8460740Z #define unix 1
2025-05-07T20:25:06.8460965Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:06.8461274Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8461575Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:06.8461886Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:06.8462206Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:06.8462460Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:06.8462719Z #define __ELF__ 1
2025-05-07T20:25:06.8462944Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:06.8463329Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:06.8463609Z #define __FLT_RADIX__ 2
2025-05-07T20:25:06.8463851Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:06.8464286Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:06.8464648Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:06.8464896Z #define __SSE_MATH__ 1
2025-05-07T20:25:06.8465117Z #define __k8 1
2025-05-07T20:25:06.8465407Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:06.8465771Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:06.8466065Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:06.8466359Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:06.8466613Z #define __LDBL_DIG__ 18
2025-05-07T20:25:06.8466848Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:06.8467104Z #define __x86_64__ 1
2025-05-07T20:25:06.8467346Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:06.8467639Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:06.8467978Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8468287Z #define __FLT64_DIG__ 15
2025-05-07T20:25:06.8468563Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8468920Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:06.8469242Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8469499Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:06.8469776Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8470070Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:06.8470437Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:06.8470826Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:06.8471120Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:06.8471453Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:06.8471772Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:06.8472072Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:06.8472351Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:06.8472654Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:06.8472934Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:06.8473178Z #define __SEG_FS 1
2025-05-07T20:25:06.8473412Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:06.8473691Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:06.8473970Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8474247Z #define __SEG_GS 1
2025-05-07T20:25:06.8474557Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:06.8474936Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:06.8475210Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:06.8475492Z #define __INT16_TYPE__ short int
2025-05-07T20:25:06.8475772Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:06.8476065Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:06.8476325Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:06.8476579Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:06.8476842Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:06.8477219Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:06.8477614Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8477908Z #define linux 1
2025-05-07T20:25:06.8478130Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8478414Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:06.8478685Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:06.8478937Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:06.8479194Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:06.8479453Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:06.8479796Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:06.8480194Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:06.8480519Z #define __code_model_small__ 1
2025-05-07T20:25:06.8480792Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:06.8481065Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:06.8481312Z #define __k8__ 1
2025-05-07T20:25:06.8481635Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:06.8481924Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:06.8482224Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:06.8482547Z #define __pic__ 2
2025-05-07T20:25:06.8482798Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8483113Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:06.8483410Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8483742Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:06.8484102Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:06.8484465Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:06.8484744Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:06.8485034Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:06.8485346Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:06.8485603Z #define __linux__ 1
2025-05-07T20:25:06.8485830Z #define __INT64_TYPE__ long int
2025-05-07T20:25:06.8486101Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:06.8486371Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:06.8486643Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:06.8486902Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:06.8487209Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8487541Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:06.8487880Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:06.8488153Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:06.8488450Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:06.8488740Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:06.8489075Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:06.8489433Z #define __SSE__ 1
2025-05-07T20:25:06.8489659Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:06.8490307Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:06.8490717Z #define __amd64__ 1
2025-05-07T20:25:06.8490938Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:06.8491192Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:06.8491470Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:06.8491737Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:06.8492008Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:06.8492290Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:06.8492551Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:06.8492819Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:06.8493082Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:06.8493430Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:06.8493891Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:06.8494243Z #define _LP64 1
2025-05-07T20:25:06.8494463Z #define __UINT8_C(c) c
2025-05-07T20:25:06.8494706Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:06.8494965Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:06.8495236Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:06.8495512Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:06.8495817Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:06.8496168Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:06.8496630Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:06.8497001Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8497304Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.8497612Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:06.8497971Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:06.8498329Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:06.8498594Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:06.8498931Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:06.8499291Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:06.8499548Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:06.8499875Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:06.8500120Z #define __FXSR__ 1
2025-05-07T20:25:06.8500619Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:06.8501074Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:06.8501602Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:06.8501900Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:06.8502158Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:06.8502490Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:06.8502836Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:06.8503081Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:06.8503322Z #define __PIC__ 2
2025-05-07T20:25:06.8503567Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:06.8503959Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:06.8504340Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:06.8504666Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:06.8504993Z #define __SSE2__ 1
2025-05-07T20:25:06.8505217Z #define __INT32_TYPE__ int
2025-05-07T20:25:06.8505466Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:06.8505714Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:06.8506054Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:06.8506405Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:06.8506670Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:06.8506939Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:06.8507206Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8507473Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:06.8507720Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:06.8507967Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:06.8508249Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8508545Z #define __PIE__ 2
2025-05-07T20:25:06.8508867Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:06.8509244Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:06.8509599Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:06.8509964Z #define __INT16_C(c) c
2025-05-07T20:25:06.8510190Z #define __STDC__ 1
2025-05-07T20:25:06.8510421Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:06.8510699Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:06.8510956Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:06.8511251Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:06.8511596Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:06.8511925Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:06.8512189Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:06.8512470Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:06.8512737Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:06.8513013Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:06.8513300Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.8513572Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:06.8513862Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.8514252Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:06.8514623Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:06.8514922Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:06.8515212Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:06.8515461Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:06.8515618Z 
2025-05-07T20:25:06.9032776Z 
2025-05-07T20:25:06.9033106Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:06.9033542Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:06.9033772Z 
2025-05-07T20:25:08.7949659Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:08.7950255Z #define __cpp_attributes 200809L
2025-05-07T20:25:08.7950733Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:08.7951239Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:08.7951661Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:08.7952032Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:08.7952882Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:08.7953241Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:08.7953525Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:08.7953981Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:08.7954288Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:08.7954554Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:08.7954806Z #define __CHAR_BIT__ 8
2025-05-07T20:25:08.7955045Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:08.7955291Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:08.7955538Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:08.7955813Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:08.7956094Z #define __cpp_static_assert 201411L
2025-05-07T20:25:08.7956374Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:08.7956676Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.7956976Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:08.7957265Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:08.7957589Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:08.7957966Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:08.7958368Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:08.7958777Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:08.7959090Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:08.7959370Z #define __GCC_IEC_559 2
2025-05-07T20:25:08.7959608Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:08.7959884Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:08.7960161Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:08.7960441Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:08.7960734Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:08.7961054Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:08.7961363Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:08.7961690Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.7962011Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:08.7962291Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.7962561Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:08.7962839Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:08.7963145Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:08.7963403Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:08.7963666Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:08.7963943Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:08.7964265Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:08.7964598Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:08.7964855Z #define __INT8_C(c) c
2025-05-07T20:25:08.7965093Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:08.7965369Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:08.7965690Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.7966012Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:08.7966280Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:08.7966570Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:08.7966889Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:08.7967234Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:08.7967524Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:08.7967801Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:08.7968061Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.7968339Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:08.7968614Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:08.7968993Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:08.7969396Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:08.7969680Z #define __linux 1
2025-05-07T20:25:08.7969909Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:08.7970183Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:08.7970462Z #define __unix 1
2025-05-07T20:25:08.7970685Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:08.7970963Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:08.7971340Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:08.7971614Z #define __WINT_MIN__ 0U
2025-05-07T20:25:08.7971853Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.7972209Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:08.7972482Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:08.7972744Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:08.7972994Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:08.7973274Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:08.7973561Z #define __INT64_C(c) c ## L
2025-05-07T20:25:08.7973824Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:08.7974118Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:08.7974389Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:08.7974683Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:08.7974958Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:08.7975219Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:08.7975563Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:08.7975939Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:08.7976193Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:08.7976462Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:08.7976748Z #define __DBL_DIG__ 15
2025-05-07T20:25:08.7976978Z #define __FLT32_DIG__ 6
2025-05-07T20:25:08.7977277Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:08.7977625Z #define __GXX_WEAK__ 1
2025-05-07T20:25:08.7977860Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:08.7978101Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:08.7978420Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:08.7978763Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:08.7979041Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:08.7979341Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:08.7979671Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:08.7980167Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:08.7980565Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:08.7980839Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:08.7981097Z #define __unix__ 1
2025-05-07T20:25:08.7981320Z #define __INT_WIDTH__ 32
2025-05-07T20:25:08.7981574Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:08.7981821Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:08.7982068Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:08.7982333Z #define __UINT16_C(c) c
2025-05-07T20:25:08.7982568Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:08.7982819Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:08.7983180Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:08.7992283Z #define __gnu_linux__ 1
2025-05-07T20:25:08.7992559Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:08.7992838Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:08.7993133Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.7993427Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.7993693Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:08.7993972Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:08.7994228Z #define __GNUC__ 11
2025-05-07T20:25:08.7994444Z #define __GXX_RTTI 1
2025-05-07T20:25:08.7994680Z #define __pie__ 2
2025-05-07T20:25:08.7994905Z #define __MMX__ 1
2025-05-07T20:25:08.7995127Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:08.7995408Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:08.7995697Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:08.7995963Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:08.7996223Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:08.7996530Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:08.7996848Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:08.7997200Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:08.7997580Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:08.7997895Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.7998207Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:08.8000256Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:08.8000541Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:08.8000849Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:08.8001272Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:08.8001548Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:08.8001805Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:08.8002097Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:08.8002395Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:08.8002663Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:08.8002942Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:08.8003204Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:08.8003473Z #define __cplusplus 201703L
2025-05-07T20:25:08.8003747Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:08.8004035Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:08.8004294Z #define __DEPRECATED 1
2025-05-07T20:25:08.8004542Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:08.8004838Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:08.8005109Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:08.8005425Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:08.8005795Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:08.8006130Z #define __SSE2_MATH__ 1
2025-05-07T20:25:08.8006372Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:08.8006679Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8006971Z #define __amd64 1
2025-05-07T20:25:08.8007192Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:08.8007467Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:08.8007776Z #define __GNUG__ 11
2025-05-07T20:25:08.8008042Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:08.8008361Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:08.8008628Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:08.8008896Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:08.8009168Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:08.8009432Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:08.8009720Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:08.8010018Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:08.8010292Z #define __cpp_hex_float 201603L
2025-05-07T20:25:08.8010570Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:08.8010835Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:08.8011117Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:08.8011388Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:08.8011654Z #define __x86_64 1
2025-05-07T20:25:08.8011888Z #define __cpp_lambdas 200907L
2025-05-07T20:25:08.8012166Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:08.8012534Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:08.8012929Z #define __cpp_template_auto 201606L
2025-05-07T20:25:08.8013292Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:08.8013746Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:08.8014218Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:08.8014612Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:08.8014868Z #define __LP64__ 1
2025-05-07T20:25:08.8015101Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.8015457Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:08.8015842Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:08.8016114Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.8016406Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:08.8016689Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:08.8016969Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:08.8017227Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:08.8017496Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:08.8017839Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:08.8018191Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:08.8018469Z #define __FLT_DIG__ 6
2025-05-07T20:25:08.8018706Z #define __NO_INLINE__ 1
2025-05-07T20:25:08.8019080Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:08.8019414Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:08.8019762Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:08.8020271Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:08.8020537Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:08.8020797Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:08.8021069Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:08.8021367Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:08.8021629Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:08.8021923Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:08.8022203Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:08.8022480Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:08.8022792Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:08.8023125Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:08.8023423Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:08.8023693Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:08.8023955Z #define __FLT128_DIG__ 33
2025-05-07T20:25:08.8024201Z #define __INT32_C(c) c
2025-05-07T20:25:08.8024446Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:08.8024732Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:08.8025012Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:08.8025298Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:08.8025607Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:08.8025915Z #define unix 1
2025-05-07T20:25:08.8026141Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:08.8026407Z #define __cpp_rtti 199711L
2025-05-07T20:25:08.8026668Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:08.8026991Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.8027302Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:08.8027610Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:08.8027943Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:08.8028201Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:08.8028488Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:08.8028777Z #define __ELF__ 1
2025-05-07T20:25:08.8029013Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:08.8029297Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:08.8029576Z #define __FLT_RADIX__ 2
2025-05-07T20:25:08.8029827Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:08.8030177Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:08.8030543Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:08.8030822Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:08.8031101Z #define __k8 1
2025-05-07T20:25:08.8031393Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:08.8031764Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:08.8032060Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:08.8032356Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:08.8032620Z #define __LDBL_DIG__ 18
2025-05-07T20:25:08.8032870Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:08.8033129Z #define __x86_64__ 1
2025-05-07T20:25:08.8033368Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:08.8033670Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:08.8034005Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8034311Z #define __FLT64_DIG__ 15
2025-05-07T20:25:08.8034596Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.8034943Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:08.8035251Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.8035520Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:08.8035801Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8036093Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:08.8036463Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:08.8036856Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:08.8037143Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:08.8037471Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:08.8037875Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:08.8038195Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:08.8038567Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:08.8038855Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:08.8039165Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:08.8039441Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:08.8039685Z #define __SEG_FS 1
2025-05-07T20:25:08.8039920Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:08.8040193Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:08.8040472Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8040760Z #define __SEG_GS 1
2025-05-07T20:25:08.8041069Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:08.8041452Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:08.8041733Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:08.8042016Z #define __INT16_TYPE__ short int
2025-05-07T20:25:08.8042305Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:08.8042617Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:08.8042909Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:08.8043167Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:08.8043431Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:08.8043774Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:08.8044151Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8044466Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:08.8044792Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:08.8045086Z #define linux 1
2025-05-07T20:25:08.8045318Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.8045598Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:08.8045868Z #define __EXCEPTIONS 1
2025-05-07T20:25:08.8046119Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:08.8046385Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:08.8046647Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:08.8046942Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:08.8047289Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:08.8047691Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:08.8048076Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:08.8048405Z #define __code_model_small__ 1
2025-05-07T20:25:08.8048680Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:08.8048990Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:08.8049285Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:08.8049565Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:08.8049855Z #define __k8__ 1
2025-05-07T20:25:08.8050079Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:08.8050370Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:08.8050671Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:08.8050911Z #define __pic__ 2
2025-05-07T20:25:08.8051168Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.8051480Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:08.8051749Z #define __cpp_decltype 200707L
2025-05-07T20:25:08.8052048Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8052384Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:08.8052742Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:08.8053101Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:08.8053396Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:08.8053718Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:08.8054002Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:08.8054259Z #define __linux__ 1
2025-05-07T20:25:08.8054489Z #define __INT64_TYPE__ long int
2025-05-07T20:25:08.8054747Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:08.8055012Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:08.8055285Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:08.8055563Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:08.8055881Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:08.8056261Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8056574Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:08.8056845Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:08.8057219Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:08.8057514Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:08.8057863Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:08.8058235Z #define __SSE__ 1
2025-05-07T20:25:08.8058459Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:08.8058790Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:08.8059126Z #define __amd64__ 1
2025-05-07T20:25:08.8059348Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:08.8059596Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:08.8059932Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:08.8060199Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:08.8060462Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:08.8060717Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:08.8060996Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:08.8061254Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:08.8061595Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:08.8062063Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:08.8062409Z #define _LP64 1
2025-05-07T20:25:08.8062615Z #define __UINT8_C(c) c
2025-05-07T20:25:08.8062852Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:08.8063116Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:08.8063375Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:08.8063635Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:08.8063989Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:08.8064443Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:08.8064819Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.8065114Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.8065427Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:08.8065729Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:08.8066105Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:08.8066473Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:08.8066729Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:08.8066993Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:08.8067324Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:08.8067676Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:08.8067931Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:08.8068180Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:08.8068419Z #define __FXSR__ 1
2025-05-07T20:25:08.8068722Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:08.8069165Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:08.8069565Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:08.8069866Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:08.8070131Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:08.8070423Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:08.8070707Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:08.8070974Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:08.8071330Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:08.8071680Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:08.8071940Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:08.8072183Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:08.8072410Z #define __PIC__ 2
2025-05-07T20:25:08.8072663Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:08.8073060Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:08.8073443Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:08.8073765Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:08.8074108Z #define __cpp_constexpr 201603L
2025-05-07T20:25:08.8074449Z #define __SSE2__ 1
2025-05-07T20:25:08.8074688Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:08.8074973Z #define __INT32_TYPE__ int
2025-05-07T20:25:08.8075351Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:08.8075611Z #define __cpp_exceptions 199711L
2025-05-07T20:25:08.8075886Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:08.8076215Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:08.8076562Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:08.8076833Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:08.8077097Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:08.8077358Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.8077630Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:08.8077877Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:08.8078132Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:08.8078415Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:08.8078703Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.8079003Z #define __PIE__ 2
2025-05-07T20:25:08.8079318Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:08.8079739Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:08.8080041Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:08.8080378Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:08.8080739Z #define __INT16_C(c) c
2025-05-07T20:25:08.8080963Z #define __STDC__ 1
2025-05-07T20:25:08.8081176Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:08.8081432Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:08.8081706Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:08.8081961Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.8082252Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:08.8082595Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:08.8082922Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:08.8083182Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.8083475Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:08.8083749Z #define __SSE_MATH__ 1
2025-05-07T20:25:08.8083983Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:08.8084270Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:08.8084572Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:08.8084846Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:08.8085135Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.8085406Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:08.8085695Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.8086088Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:08.8086453Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:08.8086750Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:08.8087029Z #define _GNU_SOURCE 1
2025-05-07T20:25:08.8087276Z #define __cpp_init_captures 201304L
2025-05-07T20:25:08.8087554Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:08.8087827Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:08.8088019Z 
2025-05-07T20:25:08.8602243Z 
2025-05-07T20:25:08.8602589Z + conda run -n build_binary c++ --version
2025-05-07T20:25:08.8602885Z 
2025-05-07T20:25:10.7478011Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:10.7478561Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:10.7479035Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:10.7479567Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:10.7479889Z 
2025-05-07T20:25:10.7479894Z 
2025-05-07T20:25:10.8104501Z 
2025-05-07T20:25:10.8105197Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:10.8105848Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:12.7594497Z 
2025-05-07T20:25:12.7594933Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:12.7597058Z 
2025-05-07T20:25:12.7597899Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:12.7598685Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:12.7599078Z 
2025-05-07T20:25:14.7069644Z #define __cplusplus 201703L
2025-05-07T20:25:14.7072241Z 
2025-05-07T20:25:14.7072935Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:14.7119482Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:14.7119931Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:14.7132295Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:14.7132642Z env:
2025-05-07T20:25:14.7132862Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:14.7133162Z   BUILD_ENV: build_binary
2025-05-07T20:25:14.7133402Z   BUILD_TARGET: genai
2025-05-07T20:25:14.7133622Z   BUILD_VARIANT: cuda
2025-05-07T20:25:14.7133852Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:14.7134107Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:14.7134397Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:14.7134722Z ##[endgroup]
2025-05-07T20:25:15.0522579Z ################################################################################
2025-05-07T20:25:15.0522966Z # Install CUDA
2025-05-07T20:25:15.0523184Z #
2025-05-07T20:25:15.0539011Z # [2025-05-07T20:25:15.053Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:15.0539467Z ################################################################################
2025-05-07T20:25:15.0539769Z 
2025-05-07T20:25:15.0556415Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:15.1433431Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:15.1433793Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:15.1438637Z + conda clean --packages --tarball -y
2025-05-07T20:25:15.1438940Z 
2025-05-07T20:25:15.8560382Z Will remove 32 (142.2 MB) tarball(s).
2025-05-07T20:25:15.8560792Z Will remove 6 (617 KB) package(s).
2025-05-07T20:25:15.9199233Z 
2025-05-07T20:25:15.9207936Z + conda clean --all -y
2025-05-07T20:25:15.9208164Z 
2025-05-07T20:25:16.5927925Z There are no unused tarball(s) to remove.
2025-05-07T20:25:16.5928344Z Will remove 1 index cache(s).
2025-05-07T20:25:16.5928668Z There are no unused package(s) to remove.
2025-05-07T20:25:16.5928984Z There are no tempfile(s) to remove.
2025-05-07T20:25:16.5929280Z There are no logfile(s) to remove.
2025-05-07T20:25:16.6574627Z 
2025-05-07T20:25:16.6589476Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:16.6614215Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:17.5742966Z Channels:
2025-05-07T20:25:17.5743223Z  - conda-forge
2025-05-07T20:25:17.5743452Z Platform: linux-64
2025-05-07T20:25:28.0713914Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:25:29.1633966Z Solving environment: | / - \ | done
2025-05-07T20:25:29.2372733Z 
2025-05-07T20:25:29.2373378Z ## Package Plan ##
2025-05-07T20:25:29.2373614Z 
2025-05-07T20:25:29.2373921Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:29.2374352Z 
2025-05-07T20:25:29.2374495Z   added / updated specs:
2025-05-07T20:25:29.2374836Z     - cuda=12.6.3
2025-05-07T20:25:29.2375003Z 
2025-05-07T20:25:29.2375033Z 
2025-05-07T20:25:29.2375196Z The following packages will be downloaded:
2025-05-07T20:25:29.2375497Z 
2025-05-07T20:25:29.2375667Z     package                    |            build
2025-05-07T20:25:29.2376051Z     ---------------------------|-----------------
2025-05-07T20:25:29.2376558Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:29.2377133Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:29.2377676Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:29.2378154Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:29.2378561Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:29.2378979Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:29.2379939Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:29.2380641Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:29.2381110Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:29.2381582Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:29.2382022Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.2382480Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.2382966Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:29.2383517Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.2384080Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:29.2384597Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:29.2385067Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:29.2385506Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:29.2385943Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:29.2386392Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:29.2386838Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:29.2387321Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:29.2387775Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:29.2388206Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:29.2388666Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:29.2389131Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:29.2389554Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:29.2390297Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:29.2390754Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:29.2391204Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:29.2391657Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:29.2392103Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:29.2392546Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:29.2392984Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:29.2393426Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:29.2393862Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:29.2394296Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:29.2394727Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:29.2395167Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:29.2395626Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:29.2396073Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:29.2396508Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:29.2396928Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:29.2397377Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:29.2397982Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:29.2398579Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:29.2399037Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:29.2399492Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:29.2399917Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:29.2400332Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:29.2400779Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:29.2401229Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:29.2401633Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:29.2402016Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:29.2402478Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:29.2402991Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:29.2403486Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:29.2403970Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:29.2404406Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:29.2404857Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:29.2405314Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:29.2405751Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:29.2406142Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:29.2406534Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:29.2406938Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:29.2407303Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:29.2407710Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:29.2408099Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:29.2408484Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:29.2408886Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:29.2409321Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:29.2409756Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:29.2410190Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:29.2410621Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:29.2411060Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:29.2411498Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:29.2411941Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:29.2412380Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:29.2412831Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:29.2413284Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:29.2413732Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:29.2414239Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:29.2414761Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:29.2415267Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:29.2415701Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:29.2416147Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:29.2416579Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:29.2416995Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:29.2417416Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:29.2417817Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:29.2418223Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:29.2418638Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:29.2419063Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:29.2419462Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:29.2419963Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:29.2420421Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:29.2420878Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:29.2421338Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:29.2421781Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:29.2422225Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:29.2422650Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:29.2423066Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:29.2423490Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:29.2423922Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:29.2424333Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:29.2424730Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:29.2425149Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:29.2425585Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:29.2425999Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:29.2426398Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:29.2426791Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:29.2427230Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:29.2427655Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:29.2428038Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:29.2428431Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:29.2428874Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:29.2429300Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:29.2429725Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:29.2430167Z     python-3.10.13             |hd12c33a_1_cpython        24.5 MB  conda-forge
2025-05-07T20:25:29.2430583Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:29.2431084Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:29.2431594Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:29.2431996Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:29.2432398Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:29.2432837Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:29.2433288Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:29.2433740Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:29.2434245Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:29.2434717Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:29.2435166Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:29.2435624Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:29.2436057Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:29.2436482Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:29.2436912Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:29.2437367Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:29.2437841Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:29.2438293Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:29.2438727Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:29.2439172Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:29.2439613Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:29.2440054Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:29.2440510Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:29.2440962Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:29.2441373Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:29.2441746Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:29.2442121Z     ------------------------------------------------------------
2025-05-07T20:25:29.2442463Z                                            Total:        1.63 GB
2025-05-07T20:25:29.2442670Z 
2025-05-07T20:25:29.2442807Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:29.2443024Z 
2025-05-07T20:25:29.2443235Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:29.2443657Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:29.2444073Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:29.2444528Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:29.2444949Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:29.2445424Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:29.2446012Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:29.2446587Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:29.2447118Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:29.2447665Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:29.2448181Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:29.2448789Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.2449436Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.2450041Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:29.2452355Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.2452962Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.2453516Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.2454031Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:29.2454528Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:29.2455045Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.2455597Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.2456174Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:29.2456701Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:29.2457176Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:29.2457728Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:29.2458266Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:29.2458739Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:29.2459242Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:29.2459939Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:29.2460477Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:29.2461033Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:29.2461555Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.2462066Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.2462561Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:29.2463051Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.2463530Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:29.2464027Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:29.2464520Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:29.2465032Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:29.2465581Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:29.2466121Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:29.2466706Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:29.2467373Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:29.2468076Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.2468735Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:29.2469271Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:29.2469799Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:29.2470336Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:29.2470814Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:29.2471636Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:29.2472466Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:29.2473007Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:29.2473448Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:29.2473840Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:29.2474343Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:29.2475157Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:29.2475966Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:29.2476531Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:29.2477016Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:29.2477517Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:29.2478012Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:29.2478472Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:29.2478884Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:29.2479305Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:29.2479726Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:29.2480095Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:29.2480502Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:29.2480916Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:29.2481320Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:29.2481761Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:29.2482260Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:29.2482751Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:29.2483227Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:29.2483729Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:29.2484245Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:29.2484740Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:29.2485236Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:29.2485741Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:29.2486262Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:29.2486794Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:29.2487317Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:29.2487814Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:29.2488267Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:29.2488726Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:29.2489210Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:29.2489713Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:29.2490410Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:29.2490859Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:29.2491316Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:29.2491920Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:29.2492454Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:29.2492915Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:29.2493355Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:29.2493772Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:29.2494231Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:29.2494750Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:29.2495271Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:29.2495976Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:29.2496698Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:29.2497237Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:29.2497721Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:29.2498155Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:29.2498616Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:29.2499064Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:29.2499487Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:29.2500039Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:29.2500526Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:29.2500961Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:29.2501382Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:29.2501797Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:29.2502281Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:29.2502748Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:29.2503121Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:29.2503511Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:29.2503998Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:29.2504476Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:29.2504934Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:29.2514378Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:29.2515059Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:29.2515536Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:29.2516020Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:29.2516539Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:29.2517063Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:29.2517625Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:29.2518145Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:29.2518644Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:29.2519156Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:29.2519617Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:29.2520077Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:29.2520710Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:29.2521347Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:29.2521921Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:29.2522454Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:29.2522952Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:29.2523463Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:29.2523962Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:29.2524506Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:29.2525035Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:29.2525567Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:29.2526024Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:29.2526266Z 
2025-05-07T20:25:29.2526401Z The following packages will be UPDATED:
2025-05-07T20:25:29.2526607Z 
2025-05-07T20:25:29.2526877Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:29.2527477Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:29.2527799Z 
2025-05-07T20:25:29.2528023Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:29.2528337Z 
2025-05-07T20:25:29.2528628Z   python               pkgs/main::python-3.10.16-he870216_1 --> conda-forge::python-3.10.13-hd12c33a_1_cpython 
2025-05-07T20:25:29.2529250Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:29.2529818Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:29.2530151Z 
2025-05-07T20:25:29.2530177Z 
2025-05-07T20:25:29.2530181Z 
2025-05-07T20:25:29.2530331Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:29.2530725Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:29.2530959Z 
2025-05-07T20:25:29.2531353Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:29.2531589Z 
2025-05-07T20:25:29.2531593Z 
2025-05-07T20:25:29.2531804Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:29.2532049Z 
2025-05-07T20:25:29.2532053Z 
2025-05-07T20:25:29.2532057Z 
2025-05-07T20:25:29.2532286Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:29.2532546Z 
2025-05-07T20:25:29.2532550Z 
2025-05-07T20:25:29.2532554Z 
2025-05-07T20:25:29.2532558Z 
2025-05-07T20:25:29.2543071Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:29.2543355Z 
2025-05-07T20:25:29.2543366Z 
2025-05-07T20:25:29.2543370Z 
2025-05-07T20:25:29.2543374Z 
2025-05-07T20:25:29.2543378Z 
2025-05-07T20:25:29.2544662Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:29.2544939Z 
2025-05-07T20:25:29.2544943Z 
2025-05-07T20:25:29.2544947Z 
2025-05-07T20:25:29.2544951Z 
2025-05-07T20:25:29.2544954Z 
2025-05-07T20:25:29.2547593Z 
2025-05-07T20:25:29.2548962Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:29.2549256Z 
2025-05-07T20:25:29.2549260Z 
2025-05-07T20:25:29.2549264Z 
2025-05-07T20:25:29.2549267Z 
2025-05-07T20:25:29.2549271Z 
2025-05-07T20:25:29.2549275Z 
2025-05-07T20:25:29.2549279Z 
2025-05-07T20:25:29.2551090Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:29.2551369Z 
2025-05-07T20:25:29.2551372Z 
2025-05-07T20:25:29.2551376Z 
2025-05-07T20:25:29.2551380Z 
2025-05-07T20:25:29.2551383Z 
2025-05-07T20:25:29.2551387Z 
2025-05-07T20:25:29.2551390Z 
2025-05-07T20:25:29.2551502Z 
2025-05-07T20:25:29.2553638Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2553937Z 
2025-05-07T20:25:29.2554039Z 
2025-05-07T20:25:29.2554043Z 
2025-05-07T20:25:29.2554047Z 
2025-05-07T20:25:29.2554050Z 
2025-05-07T20:25:29.2554054Z 
2025-05-07T20:25:29.2554057Z 
2025-05-07T20:25:29.2554061Z 
2025-05-07T20:25:29.2554064Z 
2025-05-07T20:25:29.2554933Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2555212Z 
2025-05-07T20:25:29.2555215Z 
2025-05-07T20:25:29.2555224Z 
2025-05-07T20:25:29.2555227Z 
2025-05-07T20:25:29.2555231Z 
2025-05-07T20:25:29.2555234Z 
2025-05-07T20:25:29.2555238Z 
2025-05-07T20:25:29.2555241Z 
2025-05-07T20:25:29.2555245Z 
2025-05-07T20:25:29.2555249Z 
2025-05-07T20:25:29.2556193Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2556468Z 
2025-05-07T20:25:29.2556471Z 
2025-05-07T20:25:29.2556475Z 
2025-05-07T20:25:29.2556486Z 
2025-05-07T20:25:29.2556489Z 
2025-05-07T20:25:29.2556503Z 
2025-05-07T20:25:29.2556507Z 
2025-05-07T20:25:29.2556510Z 
2025-05-07T20:25:29.2556514Z 
2025-05-07T20:25:29.2556523Z 
2025-05-07T20:25:29.2556527Z 
2025-05-07T20:25:29.2557981Z python-3.10.13       | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2558261Z 
2025-05-07T20:25:29.2558265Z 
2025-05-07T20:25:29.2558268Z 
2025-05-07T20:25:29.2558272Z 
2025-05-07T20:25:29.2558275Z 
2025-05-07T20:25:29.2558279Z 
2025-05-07T20:25:29.2558283Z 
2025-05-07T20:25:29.2558286Z 
2025-05-07T20:25:29.2558290Z 
2025-05-07T20:25:29.2558294Z 
2025-05-07T20:25:29.2558297Z 
2025-05-07T20:25:29.2558301Z 
2025-05-07T20:25:29.2559195Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2559490Z 
2025-05-07T20:25:29.2559493Z 
2025-05-07T20:25:29.2559497Z 
2025-05-07T20:25:29.2559501Z 
2025-05-07T20:25:29.2559504Z 
2025-05-07T20:25:29.2559508Z 
2025-05-07T20:25:29.2559511Z 
2025-05-07T20:25:29.2559521Z 
2025-05-07T20:25:29.2559525Z 
2025-05-07T20:25:29.2559528Z 
2025-05-07T20:25:29.2559532Z 
2025-05-07T20:25:29.2559535Z 
2025-05-07T20:25:29.2559546Z 
2025-05-07T20:25:29.2560918Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2561208Z 
2025-05-07T20:25:29.2561211Z 
2025-05-07T20:25:29.2561215Z 
2025-05-07T20:25:29.2561219Z 
2025-05-07T20:25:29.2561222Z 
2025-05-07T20:25:29.2561226Z 
2025-05-07T20:25:29.2561229Z 
2025-05-07T20:25:29.2561233Z 
2025-05-07T20:25:29.2561237Z 
2025-05-07T20:25:29.2561240Z 
2025-05-07T20:25:29.2561244Z 
2025-05-07T20:25:29.2561247Z 
2025-05-07T20:25:29.2561251Z 
2025-05-07T20:25:29.2562248Z 
2025-05-07T20:25:29.2566183Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2566487Z 
2025-05-07T20:25:29.2566490Z 
2025-05-07T20:25:29.2566494Z 
2025-05-07T20:25:29.2566498Z 
2025-05-07T20:25:29.2566501Z 
2025-05-07T20:25:29.2566505Z 
2025-05-07T20:25:29.2566515Z 
2025-05-07T20:25:29.2566519Z 
2025-05-07T20:25:29.2566523Z 
2025-05-07T20:25:29.2566526Z 
2025-05-07T20:25:29.2566537Z 
2025-05-07T20:25:29.2566545Z 
2025-05-07T20:25:29.2566548Z 
2025-05-07T20:25:29.2566552Z 
2025-05-07T20:25:29.2566555Z 
2025-05-07T20:25:29.2568623Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2568933Z 
2025-05-07T20:25:29.2568937Z 
2025-05-07T20:25:29.2568941Z 
2025-05-07T20:25:29.2568944Z 
2025-05-07T20:25:29.2568948Z 
2025-05-07T20:25:29.2568952Z 
2025-05-07T20:25:29.2568955Z 
2025-05-07T20:25:29.2568959Z 
2025-05-07T20:25:29.2568962Z 
2025-05-07T20:25:29.2568966Z 
2025-05-07T20:25:29.2568969Z 
2025-05-07T20:25:29.2568973Z 
2025-05-07T20:25:29.2568977Z 
2025-05-07T20:25:29.2568980Z 
2025-05-07T20:25:29.2568984Z 
2025-05-07T20:25:29.2568988Z 
2025-05-07T20:25:29.2570998Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2571417Z 
2025-05-07T20:25:29.2571421Z 
2025-05-07T20:25:29.2571425Z 
2025-05-07T20:25:29.2571429Z 
2025-05-07T20:25:29.2571432Z 
2025-05-07T20:25:29.2571436Z 
2025-05-07T20:25:29.2571521Z 
2025-05-07T20:25:29.2571525Z 
2025-05-07T20:25:29.2571529Z 
2025-05-07T20:25:29.2571540Z 
2025-05-07T20:25:29.2571543Z 
2025-05-07T20:25:29.2571547Z 
2025-05-07T20:25:29.2571551Z 
2025-05-07T20:25:29.2571555Z 
2025-05-07T20:25:29.2571559Z 
2025-05-07T20:25:29.2571562Z 
2025-05-07T20:25:29.2571566Z 
2025-05-07T20:25:29.2572389Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2572706Z 
2025-05-07T20:25:29.2572715Z 
2025-05-07T20:25:29.2572719Z 
2025-05-07T20:25:29.2572723Z 
2025-05-07T20:25:29.2572726Z 
2025-05-07T20:25:29.2572730Z 
2025-05-07T20:25:29.2572733Z 
2025-05-07T20:25:29.2572737Z 
2025-05-07T20:25:29.2572741Z 
2025-05-07T20:25:29.2572744Z 
2025-05-07T20:25:29.2572748Z 
2025-05-07T20:25:29.2572751Z 
2025-05-07T20:25:29.2572755Z 
2025-05-07T20:25:29.2572765Z 
2025-05-07T20:25:29.2572769Z 
2025-05-07T20:25:29.2572772Z 
2025-05-07T20:25:29.2572776Z 
2025-05-07T20:25:29.2572780Z 
2025-05-07T20:25:29.2574092Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.2574400Z 
2025-05-07T20:25:29.2574404Z 
2025-05-07T20:25:29.2574407Z 
2025-05-07T20:25:29.2574411Z 
2025-05-07T20:25:29.2574421Z 
2025-05-07T20:25:29.2574425Z 
2025-05-07T20:25:29.2574428Z 
2025-05-07T20:25:29.2574432Z 
2025-05-07T20:25:29.2574435Z 
2025-05-07T20:25:29.2574439Z 
2025-05-07T20:25:29.2574443Z 
2025-05-07T20:25:29.2574446Z 
2025-05-07T20:25:29.2574450Z 
2025-05-07T20:25:29.2574453Z 
2025-05-07T20:25:29.2574457Z 
2025-05-07T20:25:29.2574461Z 
2025-05-07T20:25:29.2574464Z 
2025-05-07T20:25:29.2574468Z 
2025-05-07T20:25:29.2574471Z 
2025-05-07T20:25:29.3473974Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.3475592Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:29.3477429Z 
2025-05-07T20:25:29.3483985Z libcublas-12.6.4.1   | 256.2 MB  |            |   1% [A
2025-05-07T20:25:29.3484342Z 
2025-05-07T20:25:29.3485192Z 
2025-05-07T20:25:29.3498218Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:29.3498593Z 
2025-05-07T20:25:29.3498599Z 
2025-05-07T20:25:29.3502932Z 
2025-05-07T20:25:29.3509832Z libcusparse-12.5.4.2 | 118.6 MB  | 2          |   3% [A[A[A
2025-05-07T20:25:29.3510215Z 
2025-05-07T20:25:29.3510221Z 
2025-05-07T20:25:29.3510226Z 
2025-05-07T20:25:29.3510231Z 
2025-05-07T20:25:29.4476828Z cuda-nsight-12.6.77  | 113.2 MB  |            |   1% [A[A[A[A
2025-05-07T20:25:29.4480208Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:25:29.4480719Z 
2025-05-07T20:25:29.4486758Z libcublas-12.6.4.1   | 256.2 MB  | 2          |   2% [A
2025-05-07T20:25:29.4487111Z 
2025-05-07T20:25:29.4488379Z 
2025-05-07T20:25:29.4501879Z libcufft-11.3.0.4    | 156.2 MB  | 2          |   2% [A[A
2025-05-07T20:25:29.4502257Z 
2025-05-07T20:25:29.4502263Z 
2025-05-07T20:25:29.4502268Z 
2025-05-07T20:25:29.4510897Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   6% [A[A[A
2025-05-07T20:25:29.4511295Z 
2025-05-07T20:25:29.4511300Z 
2025-05-07T20:25:29.4511305Z 
2025-05-07T20:25:29.4511311Z 
2025-05-07T20:25:29.5478763Z cuda-nsight-12.6.77  | 113.2 MB  | 4          |   4% [A[A[A[A
2025-05-07T20:25:29.5480617Z nsight-compute-2024. | 443.1 MB  | 1          |   1% 
2025-05-07T20:25:29.5482799Z 
2025-05-07T20:25:29.5487139Z libcublas-12.6.4.1   | 256.2 MB  | 3          |   4% [A
2025-05-07T20:25:29.5487496Z 
2025-05-07T20:25:29.5488300Z 
2025-05-07T20:25:29.5509186Z libcufft-11.3.0.4    | 156.2 MB  | 4          |   5% [A[A
2025-05-07T20:25:29.5509559Z 
2025-05-07T20:25:29.5509565Z 
2025-05-07T20:25:29.5510326Z 
2025-05-07T20:25:29.5521382Z libcusparse-12.5.4.2 | 118.6 MB  | 8          |   9% [A[A[A
2025-05-07T20:25:29.5521774Z 
2025-05-07T20:25:29.5521779Z 
2025-05-07T20:25:29.5521785Z 
2025-05-07T20:25:29.5524375Z 
2025-05-07T20:25:29.6481340Z cuda-nsight-12.6.77  | 113.2 MB  | 7          |   7% [A[A[A[A
2025-05-07T20:25:29.6487337Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:25:29.6487741Z 
2025-05-07T20:25:29.6488885Z 
2025-05-07T20:25:29.6511003Z libcufft-11.3.0.4    | 156.2 MB  | 6          |   7% [A[A
2025-05-07T20:25:29.6511418Z 
2025-05-07T20:25:29.6511422Z 
2025-05-07T20:25:29.6511426Z 
2025-05-07T20:25:29.6514837Z libcusparse-12.5.4.2 | 118.6 MB  | #1         |  12% [A[A[A
2025-05-07T20:25:29.6515353Z 
2025-05-07T20:25:29.6515552Z 
2025-05-07T20:25:29.6515556Z 
2025-05-07T20:25:29.6515814Z 
2025-05-07T20:25:29.6518310Z cuda-nsight-12.6.77  | 113.2 MB  | #          |  10% [A[A[A[A
2025-05-07T20:25:29.6520626Z 
2025-05-07T20:25:29.7485592Z libcublas-12.6.4.1   | 256.2 MB  | 5          |   5% [A
2025-05-07T20:25:29.7491323Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:25:29.7491566Z 
2025-05-07T20:25:29.7491571Z 
2025-05-07T20:25:29.7514865Z libcufft-11.3.0.4    | 156.2 MB  | 9          |   9% [A[A
2025-05-07T20:25:29.7515118Z 
2025-05-07T20:25:29.7515123Z 
2025-05-07T20:25:29.7515127Z 
2025-05-07T20:25:29.7516154Z 
2025-05-07T20:25:29.7547515Z cuda-nsight-12.6.77  | 113.2 MB  | #3         |  14% [A[A[A[A
2025-05-07T20:25:29.7547826Z 
2025-05-07T20:25:29.7667756Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   6% [A
2025-05-07T20:25:29.7668004Z 
2025-05-07T20:25:29.7668008Z 
2025-05-07T20:25:29.7669541Z 
2025-05-07T20:25:29.8487932Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:29.8495331Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:25:29.8495574Z 
2025-05-07T20:25:29.8497182Z 
2025-05-07T20:25:29.8519089Z libcufft-11.3.0.4    | 156.2 MB  | #1         |  11% [A[A
2025-05-07T20:25:29.8519340Z 
2025-05-07T20:25:29.8519343Z 
2025-05-07T20:25:29.8519347Z 
2025-05-07T20:25:29.8519358Z 
2025-05-07T20:25:29.8597619Z cuda-nsight-12.6.77  | 113.2 MB  | #6         |  16% [A[A[A[A
2025-05-07T20:25:29.8599937Z 
2025-05-07T20:25:29.8723052Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   8% [A
2025-05-07T20:25:29.8723295Z 
2025-05-07T20:25:29.8723306Z 
2025-05-07T20:25:29.8723310Z 
2025-05-07T20:25:29.9488766Z libcusparse-12.5.4.2 | 118.6 MB  | #7         |  18% [A[A[A
2025-05-07T20:25:29.9498036Z nsight-compute-2024. | 443.1 MB  | 4          |   5% 
2025-05-07T20:25:29.9498284Z 
2025-05-07T20:25:29.9501252Z 
2025-05-07T20:25:29.9532096Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  14% [A[A
2025-05-07T20:25:29.9532348Z 
2025-05-07T20:25:29.9532352Z 
2025-05-07T20:25:29.9532356Z 
2025-05-07T20:25:29.9536514Z 
2025-05-07T20:25:29.9597633Z cuda-nsight-12.6.77  | 113.2 MB  | #9         |  19% [A[A[A[A
2025-05-07T20:25:29.9597953Z 
2025-05-07T20:25:29.9787708Z libcublas-12.6.4.1   | 256.2 MB  | 9          |   9% [A
2025-05-07T20:25:29.9787959Z 
2025-05-07T20:25:29.9788175Z 
2025-05-07T20:25:29.9789348Z 
2025-05-07T20:25:30.0491286Z libcusparse-12.5.4.2 | 118.6 MB  | ##         |  20% [A[A[A
2025-05-07T20:25:30.0499118Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:25:30.0499373Z 
2025-05-07T20:25:30.0499738Z 
2025-05-07T20:25:30.0538060Z libcufft-11.3.0.4    | 156.2 MB  | #5         |  16% [A[A
2025-05-07T20:25:30.0538337Z 
2025-05-07T20:25:30.0538342Z 
2025-05-07T20:25:30.0538346Z 
2025-05-07T20:25:30.0540783Z 
2025-05-07T20:25:30.0600666Z cuda-nsight-12.6.77  | 113.2 MB  | ##2        |  22% [A[A[A[A
2025-05-07T20:25:30.0601176Z 
2025-05-07T20:25:30.0866746Z libcublas-12.6.4.1   | 256.2 MB  | #          |  10% [A
2025-05-07T20:25:30.0867153Z 
2025-05-07T20:25:30.0867159Z 
2025-05-07T20:25:30.0869486Z 
2025-05-07T20:25:30.1492318Z libcusparse-12.5.4.2 | 118.6 MB  | ##3        |  23% [A[A[A
2025-05-07T20:25:30.1502876Z nsight-compute-2024. | 443.1 MB  | 6          |   6% 
2025-05-07T20:25:30.1503185Z 
2025-05-07T20:25:30.1503191Z 
2025-05-07T20:25:30.1563019Z libcufft-11.3.0.4    | 156.2 MB  | #8         |  18% [A[A
2025-05-07T20:25:30.1563540Z 
2025-05-07T20:25:30.1563546Z 
2025-05-07T20:25:30.1563552Z 
2025-05-07T20:25:30.1563723Z 
2025-05-07T20:25:30.1872072Z cuda-nsight-12.6.77  | 113.2 MB  | ##5        |  25% [A[A[A[A
2025-05-07T20:25:30.1872378Z 
2025-05-07T20:25:30.1872382Z 
2025-05-07T20:25:30.1872949Z 
2025-05-07T20:25:30.2492909Z libcusparse-12.5.4.2 | 118.6 MB  | ##6        |  26% [A[A[A
2025-05-07T20:25:30.2494560Z 
2025-05-07T20:25:30.2500434Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  12% [A
2025-05-07T20:25:30.2504869Z nsight-compute-2024. | 443.1 MB  | 7          |   7% 
2025-05-07T20:25:30.2505134Z 
2025-05-07T20:25:30.2505138Z 
2025-05-07T20:25:30.2595237Z libcufft-11.3.0.4    | 156.2 MB  | ##         |  21% [A[A
2025-05-07T20:25:30.2595505Z 
2025-05-07T20:25:30.2595509Z 
2025-05-07T20:25:30.2595513Z 
2025-05-07T20:25:30.2595516Z 
2025-05-07T20:25:30.2873427Z cuda-nsight-12.6.77  | 113.2 MB  | ##8        |  28% [A[A[A[A
2025-05-07T20:25:30.2873784Z 
2025-05-07T20:25:30.2873788Z 
2025-05-07T20:25:30.2874487Z 
2025-05-07T20:25:30.3496619Z libcusparse-12.5.4.2 | 118.6 MB  | ##9        |  29% [A[A[A
2025-05-07T20:25:30.3496915Z 
2025-05-07T20:25:30.3550114Z libcublas-12.6.4.1   | 256.2 MB  | #3         |  13% [A
2025-05-07T20:25:30.3595908Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:25:30.3596227Z 
2025-05-07T20:25:30.3596233Z 
2025-05-07T20:25:30.3596238Z 
2025-05-07T20:25:30.3599907Z 
2025-05-07T20:25:30.3610665Z cuda-nsight-12.6.77  | 113.2 MB  | ###1       |  31% [A[A[A[A
2025-05-07T20:25:30.3610961Z 
2025-05-07T20:25:30.3613287Z 
2025-05-07T20:25:30.3873325Z libcufft-11.3.0.4    | 156.2 MB  | ##3        |  23% [A[A
2025-05-07T20:25:30.3873640Z 
2025-05-07T20:25:30.3873646Z 
2025-05-07T20:25:30.3874874Z 
2025-05-07T20:25:30.4502476Z libcusparse-12.5.4.2 | 118.6 MB  | ###2       |  32% [A[A[A
2025-05-07T20:25:30.4504025Z 
2025-05-07T20:25:30.4552232Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  15% [A
2025-05-07T20:25:30.4597482Z nsight-compute-2024. | 443.1 MB  | 8          |   9% 
2025-05-07T20:25:30.4597922Z 
2025-05-07T20:25:30.4597930Z 
2025-05-07T20:25:30.4597935Z 
2025-05-07T20:25:30.4598675Z 
2025-05-07T20:25:30.4735948Z cuda-nsight-12.6.77  | 113.2 MB  | ###4       |  34% [A[A[A[A
2025-05-07T20:25:30.4736233Z 
2025-05-07T20:25:30.4736238Z 
2025-05-07T20:25:30.4920068Z libcufft-11.3.0.4    | 156.2 MB  | ##5        |  26% [A[A
2025-05-07T20:25:30.4920417Z 
2025-05-07T20:25:30.4920423Z 
2025-05-07T20:25:30.4922714Z 
2025-05-07T20:25:30.5503801Z libcusparse-12.5.4.2 | 118.6 MB  | ###4       |  35% [A[A[A
2025-05-07T20:25:30.5505156Z 
2025-05-07T20:25:30.5601307Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  16% [A
2025-05-07T20:25:30.5601726Z 
2025-05-07T20:25:30.5601732Z 
2025-05-07T20:25:30.5601738Z 
2025-05-07T20:25:30.5601743Z 
2025-05-07T20:25:30.5638984Z cuda-nsight-12.6.77  | 113.2 MB  | ###7       |  38% [A[A[A[A
2025-05-07T20:25:30.5742330Z nsight-compute-2024. | 443.1 MB  | 9          |  10% 
2025-05-07T20:25:30.5742712Z 
2025-05-07T20:25:30.5744212Z 
2025-05-07T20:25:30.5920179Z libcufft-11.3.0.4    | 156.2 MB  | ##7        |  28% [A[A
2025-05-07T20:25:30.5920464Z 
2025-05-07T20:25:30.5920485Z 
2025-05-07T20:25:30.5920491Z 
2025-05-07T20:25:30.6507363Z libcusparse-12.5.4.2 | 118.6 MB  | ###7       |  38% [A[A[A
2025-05-07T20:25:30.6508095Z 
2025-05-07T20:25:30.6601522Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  17% [A
2025-05-07T20:25:30.6601857Z 
2025-05-07T20:25:30.6601861Z 
2025-05-07T20:25:30.6601865Z 
2025-05-07T20:25:30.6601869Z 
2025-05-07T20:25:30.6640722Z cuda-nsight-12.6.77  | 113.2 MB  | ####       |  41% [A[A[A[A
2025-05-07T20:25:30.6746185Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:30.6746488Z 
2025-05-07T20:25:30.6747767Z 
2025-05-07T20:25:30.6926072Z libcufft-11.3.0.4    | 156.2 MB  | ###        |  30% [A[A
2025-05-07T20:25:30.6926357Z 
2025-05-07T20:25:30.6926361Z 
2025-05-07T20:25:30.6926365Z 
2025-05-07T20:25:30.7509901Z libcusparse-12.5.4.2 | 118.6 MB  | ####1      |  41% [A[A[A
2025-05-07T20:25:30.7510479Z 
2025-05-07T20:25:30.7630927Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  19% [A
2025-05-07T20:25:30.7631261Z 
2025-05-07T20:25:30.7631537Z 
2025-05-07T20:25:30.7631543Z 
2025-05-07T20:25:30.7634497Z 
2025-05-07T20:25:30.7641362Z cuda-nsight-12.6.77  | 113.2 MB  | ####4      |  44% [A[A[A[A
2025-05-07T20:25:30.7808159Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:25:30.7808423Z 
2025-05-07T20:25:30.7808427Z 
2025-05-07T20:25:30.7926746Z libcufft-11.3.0.4    | 156.2 MB  | ###2       |  32% [A[A
2025-05-07T20:25:30.7927025Z 
2025-05-07T20:25:30.7927029Z 
2025-05-07T20:25:30.7927497Z 
2025-05-07T20:25:30.8510632Z libcusparse-12.5.4.2 | 118.6 MB  | ####4      |  44% [A[A[A
2025-05-07T20:25:30.8512093Z 
2025-05-07T20:25:30.8645325Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  20% [A
2025-05-07T20:25:30.8701141Z nsight-compute-2024. | 443.1 MB  | #2         |  12% 
2025-05-07T20:25:30.8701514Z 
2025-05-07T20:25:30.8701520Z 
2025-05-07T20:25:30.8701545Z 
2025-05-07T20:25:30.8701550Z 
2025-05-07T20:25:30.8809727Z cuda-nsight-12.6.77  | 113.2 MB  | ####7      |  47% [A[A[A[A
2025-05-07T20:25:30.8810029Z 
2025-05-07T20:25:30.8810034Z 
2025-05-07T20:25:30.8930655Z libcufft-11.3.0.4    | 156.2 MB  | ###4       |  35% [A[A
2025-05-07T20:25:30.8930935Z 
2025-05-07T20:25:30.8930939Z 
2025-05-07T20:25:30.8930943Z 
2025-05-07T20:25:30.9517477Z libcusparse-12.5.4.2 | 118.6 MB  | ####6      |  47% [A[A[A
2025-05-07T20:25:30.9518364Z 
2025-05-07T20:25:30.9654426Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  22% [A
2025-05-07T20:25:30.9702932Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:25:30.9703306Z 
2025-05-07T20:25:30.9703312Z 
2025-05-07T20:25:30.9703317Z 
2025-05-07T20:25:30.9703322Z 
2025-05-07T20:25:30.9816022Z cuda-nsight-12.6.77  | 113.2 MB  | #####      |  50% [A[A[A[A
2025-05-07T20:25:30.9816346Z 
2025-05-07T20:25:30.9816350Z 
2025-05-07T20:25:30.9933621Z libcufft-11.3.0.4    | 156.2 MB  | ###7       |  37% [A[A
2025-05-07T20:25:30.9933915Z 
2025-05-07T20:25:30.9933919Z 
2025-05-07T20:25:30.9933923Z 
2025-05-07T20:25:31.0518782Z libcusparse-12.5.4.2 | 118.6 MB  | #####      |  50% [A[A[A
2025-05-07T20:25:31.0519599Z 
2025-05-07T20:25:31.0655499Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  23% [A
2025-05-07T20:25:31.0816412Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:25:31.0816675Z 
2025-05-07T20:25:31.0816679Z 
2025-05-07T20:25:31.0933989Z libcufft-11.3.0.4    | 156.2 MB  | ###9       |  40% [A[A
2025-05-07T20:25:31.0934245Z 
2025-05-07T20:25:31.0934258Z 
2025-05-07T20:25:31.0935468Z 
2025-05-07T20:25:31.1655920Z libcusparse-12.5.4.2 | 118.6 MB  | #####3     |  54% [A[A[A
2025-05-07T20:25:31.1819999Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:25:31.1820270Z 
2025-05-07T20:25:31.1820274Z 
2025-05-07T20:25:31.1935804Z libcufft-11.3.0.4    | 156.2 MB  | ####3      |  43% [A[A
2025-05-07T20:25:31.1936062Z 
2025-05-07T20:25:31.1936067Z 
2025-05-07T20:25:31.1936087Z 
2025-05-07T20:25:31.2656803Z libcusparse-12.5.4.2 | 118.6 MB  | #####8     |  58% [A[A[A
2025-05-07T20:25:31.2699245Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:25:31.2699635Z 
2025-05-07T20:25:31.2699641Z 
2025-05-07T20:25:31.2699646Z 
2025-05-07T20:25:31.2699651Z 
2025-05-07T20:25:31.2725344Z cuda-nsight-12.6.77  | 113.2 MB  | #####3     |  53% [A[A[A[A
2025-05-07T20:25:31.2727147Z 
2025-05-07T20:25:31.2821660Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  25% [A
2025-05-07T20:25:31.2821969Z 
2025-05-07T20:25:31.2821976Z 
2025-05-07T20:25:31.2937578Z libcufft-11.3.0.4    | 156.2 MB  | ####6      |  47% [A[A
2025-05-07T20:25:31.2937918Z 
2025-05-07T20:25:31.2937922Z 
2025-05-07T20:25:31.2937926Z 
2025-05-07T20:25:31.3701563Z libcusparse-12.5.4.2 | 118.6 MB  | ######2    |  63% [A[A[A
2025-05-07T20:25:31.3701962Z 
2025-05-07T20:25:31.3701975Z 
2025-05-07T20:25:31.3701979Z 
2025-05-07T20:25:31.3703846Z 
2025-05-07T20:25:31.3856837Z cuda-nsight-12.6.77  | 113.2 MB  | #####6     |  56% [A[A[A[A
2025-05-07T20:25:31.3923380Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:25:31.3923766Z 
2025-05-07T20:25:31.4036767Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  26% [A
2025-05-07T20:25:31.4037041Z 
2025-05-07T20:25:31.4037045Z 
2025-05-07T20:25:31.4316514Z libcufft-11.3.0.4    | 156.2 MB  | ####9      |  50% [A[A
2025-05-07T20:25:31.4316876Z 
2025-05-07T20:25:31.4316882Z 
2025-05-07T20:25:31.4316886Z 
2025-05-07T20:25:31.4705449Z libcusparse-12.5.4.2 | 118.6 MB  | ######6    |  67% [A[A[A
2025-05-07T20:25:31.4705746Z 
2025-05-07T20:25:31.4705750Z 
2025-05-07T20:25:31.4705754Z 
2025-05-07T20:25:31.4706344Z 
2025-05-07T20:25:31.4927992Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  59% [A[A[A[A
2025-05-07T20:25:31.4930698Z 
2025-05-07T20:25:31.5054134Z libcublas-12.6.4.1   | 256.2 MB  | ##7        |  27% [A
2025-05-07T20:25:31.5224774Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:25:31.5225058Z 
2025-05-07T20:25:31.5227749Z 
2025-05-07T20:25:31.5602587Z libcufft-11.3.0.4    | 156.2 MB  | #####2     |  52% [A[A
2025-05-07T20:25:31.5602982Z 
2025-05-07T20:25:31.5603004Z 
2025-05-07T20:25:31.5605819Z 
2025-05-07T20:25:31.5707384Z libcusparse-12.5.4.2 | 118.6 MB  | #######    |  70% [A[A[A
2025-05-07T20:25:31.5707693Z 
2025-05-07T20:25:31.5707698Z 
2025-05-07T20:25:31.5707702Z 
2025-05-07T20:25:31.5707706Z 
2025-05-07T20:25:31.5928635Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:25:31.5929040Z 
2025-05-07T20:25:31.6152316Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  29% [A
2025-05-07T20:25:31.6352624Z nsight-compute-2024. | 443.1 MB  | #9         |  19% 
2025-05-07T20:25:31.6352884Z 
2025-05-07T20:25:31.6354368Z 
2025-05-07T20:25:31.6711517Z libcufft-11.3.0.4    | 156.2 MB  | #####5     |  55% [A[A
2025-05-07T20:25:31.6711798Z 
2025-05-07T20:25:31.6711802Z 
2025-05-07T20:25:31.6711806Z 
2025-05-07T20:25:31.6712187Z 
2025-05-07T20:25:31.6814275Z cuda-nsight-12.6.77  | 113.2 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:25:31.6814660Z 
2025-05-07T20:25:31.6814666Z 
2025-05-07T20:25:31.6816196Z 
2025-05-07T20:25:31.6932791Z libcusparse-12.5.4.2 | 118.6 MB  | #######3   |  74% [A[A[A
2025-05-07T20:25:31.6933117Z 
2025-05-07T20:25:31.7319142Z libcublas-12.6.4.1   | 256.2 MB  | ###        |  30% [A
2025-05-07T20:25:31.7530211Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:25:31.7530477Z 
2025-05-07T20:25:31.7530512Z 
2025-05-07T20:25:31.7716207Z libcufft-11.3.0.4    | 156.2 MB  | #####7     |  58% [A[A
2025-05-07T20:25:31.7716546Z 
2025-05-07T20:25:31.7716551Z 
2025-05-07T20:25:31.7716557Z 
2025-05-07T20:25:31.7721013Z 
2025-05-07T20:25:31.7935562Z cuda-nsight-12.6.77  | 113.2 MB  | ######8    |  69% [A[A[A[A
2025-05-07T20:25:31.7935915Z 
2025-05-07T20:25:31.7993820Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  31% [A
2025-05-07T20:25:31.7994173Z 
2025-05-07T20:25:31.7994177Z 
2025-05-07T20:25:31.7996576Z 
2025-05-07T20:25:31.8373329Z libcusparse-12.5.4.2 | 118.6 MB  | #######6   |  77% [A[A[A
2025-05-07T20:25:31.8595298Z nsight-compute-2024. | 443.1 MB  | ##         |  21% 
2025-05-07T20:25:31.8595580Z 
2025-05-07T20:25:31.8596216Z 
2025-05-07T20:25:31.8756790Z libcufft-11.3.0.4    | 156.2 MB  | ######     |  60% [A[A
2025-05-07T20:25:31.8757153Z 
2025-05-07T20:25:31.8757159Z 
2025-05-07T20:25:31.8757165Z 
2025-05-07T20:25:31.8757170Z 
2025-05-07T20:25:31.8935761Z cuda-nsight-12.6.77  | 113.2 MB  | #######1   |  72% [A[A[A[A
2025-05-07T20:25:31.8936137Z 
2025-05-07T20:25:31.9088229Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  33% [A
2025-05-07T20:25:31.9088532Z 
2025-05-07T20:25:31.9088536Z 
2025-05-07T20:25:31.9089037Z 
2025-05-07T20:25:31.9473640Z libcusparse-12.5.4.2 | 118.6 MB  | #######9   |  80% [A[A[A
2025-05-07T20:25:31.9627980Z nsight-compute-2024. | 443.1 MB  | ##1        |  22% 
2025-05-07T20:25:31.9628236Z 
2025-05-07T20:25:31.9628241Z 
2025-05-07T20:25:31.9758354Z libcufft-11.3.0.4    | 156.2 MB  | ######2    |  62% [A[A
2025-05-07T20:25:31.9759015Z 
2025-05-07T20:25:31.9759022Z 
2025-05-07T20:25:31.9759027Z 
2025-05-07T20:25:31.9759033Z 
2025-05-07T20:25:31.9937190Z cuda-nsight-12.6.77  | 113.2 MB  | #######4   |  75% [A[A[A[A
2025-05-07T20:25:31.9937513Z 
2025-05-07T20:25:32.0089551Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  34% [A
2025-05-07T20:25:32.0090034Z 
2025-05-07T20:25:32.0090040Z 
2025-05-07T20:25:32.0090482Z 
2025-05-07T20:25:32.0518908Z libcusparse-12.5.4.2 | 118.6 MB  | ########2  |  83% [A[A[A
2025-05-07T20:25:32.0727189Z nsight-compute-2024. | 443.1 MB  | ##2        |  23% 
2025-05-07T20:25:32.0727506Z 
2025-05-07T20:25:32.0727511Z 
2025-05-07T20:25:32.0761157Z libcufft-11.3.0.4    | 156.2 MB  | ######4    |  65% [A[A
2025-05-07T20:25:32.0761420Z 
2025-05-07T20:25:32.0761424Z 
2025-05-07T20:25:32.0761428Z 
2025-05-07T20:25:32.0762110Z 
2025-05-07T20:25:32.0937932Z cuda-nsight-12.6.77  | 113.2 MB  | #######7   |  78% [A[A[A[A
2025-05-07T20:25:32.0938245Z 
2025-05-07T20:25:32.1145385Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  35% [A
2025-05-07T20:25:32.1145654Z 
2025-05-07T20:25:32.1145658Z 
2025-05-07T20:25:32.1145943Z 
2025-05-07T20:25:32.1519463Z libcusparse-12.5.4.2 | 118.6 MB  | ########5  |  86% [A[A[A
2025-05-07T20:25:32.1752712Z nsight-compute-2024. | 443.1 MB  | ##3        |  24% 
2025-05-07T20:25:32.1753102Z 
2025-05-07T20:25:32.1754437Z 
2025-05-07T20:25:32.1762417Z libcufft-11.3.0.4    | 156.2 MB  | ######7    |  67% [A[A
2025-05-07T20:25:32.1762685Z 
2025-05-07T20:25:32.1762689Z 
2025-05-07T20:25:32.1762693Z 
2025-05-07T20:25:32.1764033Z 
2025-05-07T20:25:32.1940893Z cuda-nsight-12.6.77  | 113.2 MB  | ########1  |  81% [A[A[A[A
2025-05-07T20:25:32.1941255Z 
2025-05-07T20:25:32.2146324Z libcublas-12.6.4.1   | 256.2 MB  | ###6       |  37% [A
2025-05-07T20:25:32.2146628Z 
2025-05-07T20:25:32.2146639Z 
2025-05-07T20:25:32.2147182Z 
2025-05-07T20:25:32.2535193Z libcusparse-12.5.4.2 | 118.6 MB  | ########8  |  89% [A[A[A
2025-05-07T20:25:32.2765382Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:25:32.2765649Z 
2025-05-07T20:25:32.2765654Z 
2025-05-07T20:25:32.2765658Z 
2025-05-07T20:25:32.2765677Z 
2025-05-07T20:25:32.2825257Z cuda-nsight-12.6.77  | 113.2 MB  | ########4  |  84% [A[A[A[A
2025-05-07T20:25:32.2825572Z 
2025-05-07T20:25:32.2827185Z 
2025-05-07T20:25:32.3033340Z libcufft-11.3.0.4    | 156.2 MB  | ######9    |  69% [A[A
2025-05-07T20:25:32.3033656Z 
2025-05-07T20:25:32.3187052Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  38% [A
2025-05-07T20:25:32.3187450Z 
2025-05-07T20:25:32.3187456Z 
2025-05-07T20:25:32.3192402Z 
2025-05-07T20:25:32.3599511Z libcusparse-12.5.4.2 | 118.6 MB  | #########1 |  92% [A[A[A
2025-05-07T20:25:32.3766990Z nsight-compute-2024. | 443.1 MB  | ##5        |  25% 
2025-05-07T20:25:32.3767291Z 
2025-05-07T20:25:32.3767297Z 
2025-05-07T20:25:32.3767302Z 
2025-05-07T20:25:32.3769664Z 
2025-05-07T20:25:32.3847675Z cuda-nsight-12.6.77  | 113.2 MB  | ########7  |  87% [A[A[A[A
2025-05-07T20:25:32.3847988Z 
2025-05-07T20:25:32.3850579Z 
2025-05-07T20:25:32.4033991Z libcufft-11.3.0.4    | 156.2 MB  | #######1   |  72% [A[A
2025-05-07T20:25:32.4034281Z 
2025-05-07T20:25:32.4187588Z libcublas-12.6.4.1   | 256.2 MB  | ###9       |  39% [A
2025-05-07T20:25:32.4187857Z 
2025-05-07T20:25:32.4187861Z 
2025-05-07T20:25:32.4193330Z 
2025-05-07T20:25:32.4768823Z libcusparse-12.5.4.2 | 118.6 MB  | #########4 |  95% [A[A[A
2025-05-07T20:25:32.4769118Z 
2025-05-07T20:25:32.4769132Z 
2025-05-07T20:25:32.4769136Z 
2025-05-07T20:25:32.4770488Z 
2025-05-07T20:25:32.4791568Z cuda-nsight-12.6.77  | 113.2 MB  | #########  |  91% [A[A[A[A
2025-05-07T20:25:32.4850457Z nsight-compute-2024. | 443.1 MB  | ##6        |  26% 
2025-05-07T20:25:32.4850828Z 
2025-05-07T20:25:32.4853633Z 
2025-05-07T20:25:32.5034562Z libcufft-11.3.0.4    | 156.2 MB  | #######4   |  74% [A[A
2025-05-07T20:25:32.5034937Z 
2025-05-07T20:25:32.5189066Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  41% [A
2025-05-07T20:25:32.5189606Z 
2025-05-07T20:25:32.5189610Z 
2025-05-07T20:25:32.5191808Z 
2025-05-07T20:25:32.5793263Z libcusparse-12.5.4.2 | 118.6 MB  | #########8 |  98% [A[A[A
2025-05-07T20:25:32.5798107Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:32.5798352Z 
2025-05-07T20:25:32.5798725Z 
2025-05-07T20:25:32.5798729Z 
2025-05-07T20:25:32.5798821Z 
2025-05-07T20:25:32.5852636Z cuda-nsight-12.6.77  | 113.2 MB  | #########3 |  94% [A[A[A[A
2025-05-07T20:25:32.5852922Z 
2025-05-07T20:25:32.5854141Z 
2025-05-07T20:25:32.6038332Z libcufft-11.3.0.4    | 156.2 MB  | #######6   |  76% [A[A
2025-05-07T20:25:32.6038587Z 
2025-05-07T20:25:32.6795088Z libcublas-12.6.4.1   | 256.2 MB  | ####2      |  42% [A
2025-05-07T20:25:32.6854750Z nsight-compute-2024. | 443.1 MB  | ##7        |  28% 
2025-05-07T20:25:32.6855025Z 
2025-05-07T20:25:32.6856467Z 
2025-05-07T20:25:32.6892897Z libcufft-11.3.0.4    | 156.2 MB  | #######8   |  79% [A[A
2025-05-07T20:25:32.6893167Z 
2025-05-07T20:25:32.6893193Z 
2025-05-07T20:25:32.6893196Z 
2025-05-07T20:25:32.6893200Z 
2025-05-07T20:25:32.7041202Z cuda-nsight-12.6.77  | 113.2 MB  | #########6 |  97% [A[A[A[A
2025-05-07T20:25:32.7041633Z 
2025-05-07T20:25:32.7798643Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  44% [A
2025-05-07T20:25:32.7858406Z nsight-compute-2024. | 443.1 MB  | ##8        |  29% 
2025-05-07T20:25:32.7858770Z 
2025-05-07T20:25:32.7860232Z 
2025-05-07T20:25:32.8042753Z libcufft-11.3.0.4    | 156.2 MB  | ########1  |  81% [A[A
2025-05-07T20:25:32.8045541Z 
2025-05-07T20:25:32.8805801Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  45% [A
2025-05-07T20:25:32.8862602Z nsight-compute-2024. | 443.1 MB  | ##9        |  30% 
2025-05-07T20:25:32.8862849Z 
2025-05-07T20:25:32.8863152Z 
2025-05-07T20:25:32.9044102Z libcufft-11.3.0.4    | 156.2 MB  | ########4  |  84% [A[A
2025-05-07T20:25:32.9044661Z 
2025-05-07T20:25:32.9808911Z libcublas-12.6.4.1   | 256.2 MB  | ####6      |  47% [A
2025-05-07T20:25:32.9950301Z nsight-compute-2024. | 443.1 MB  | ###        |  31% 
2025-05-07T20:25:32.9950587Z 
2025-05-07T20:25:32.9951934Z 
2025-05-07T20:25:33.0045299Z libcufft-11.3.0.4    | 156.2 MB  | ########6  |  87% [A[A
2025-05-07T20:25:33.0048578Z 
2025-05-07T20:25:33.0817305Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  49% [A
2025-05-07T20:25:33.0950944Z nsight-compute-2024. | 443.1 MB  | ###1       |  32% 
2025-05-07T20:25:33.0951290Z 
2025-05-07T20:25:33.0953153Z 
2025-05-07T20:25:33.1048450Z libcufft-11.3.0.4    | 156.2 MB  | ########9  |  89% [A[A
2025-05-07T20:25:33.1052538Z 
2025-05-07T20:25:33.1819917Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  50% [A
2025-05-07T20:25:33.1953112Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:25:33.1953472Z 
2025-05-07T20:25:33.1955627Z 
2025-05-07T20:25:33.2053383Z libcufft-11.3.0.4    | 156.2 MB  | #########2 |  92% [A[A
2025-05-07T20:25:33.2057449Z 
2025-05-07T20:25:33.2820279Z libcublas-12.6.4.1   | 256.2 MB  | #####1     |  52% [A
2025-05-07T20:25:33.2956248Z nsight-compute-2024. | 443.1 MB  | ###3       |  34% 
2025-05-07T20:25:33.2956629Z 
2025-05-07T20:25:33.2958537Z 
2025-05-07T20:25:33.3055607Z libcufft-11.3.0.4    | 156.2 MB  | #########4 |  95% [A[A
2025-05-07T20:25:33.3057931Z 
2025-05-07T20:25:33.3824324Z libcublas-12.6.4.1   | 256.2 MB  | #####3     |  53% [A
2025-05-07T20:25:33.3958257Z nsight-compute-2024. | 443.1 MB  | ###4       |  34% 
2025-05-07T20:25:33.3958604Z 
2025-05-07T20:25:33.3960365Z 
2025-05-07T20:25:33.4057535Z libcufft-11.3.0.4    | 156.2 MB  | #########7 |  97% [A[A
2025-05-07T20:25:33.4058633Z 
2025-05-07T20:25:33.4828945Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  55% [A
2025-05-07T20:25:33.5137899Z nsight-compute-2024. | 443.1 MB  | ###5       |  35% 
2025-05-07T20:25:33.5138276Z 
2025-05-07T20:25:33.5832580Z libcublas-12.6.4.1   | 256.2 MB  | #####6     |  57% [A
2025-05-07T20:25:33.6138442Z nsight-compute-2024. | 443.1 MB  | ###6       |  36% 
2025-05-07T20:25:33.6138794Z 
2025-05-07T20:25:33.6832837Z libcublas-12.6.4.1   | 256.2 MB  | #####8     |  58% [A
2025-05-07T20:25:33.7140163Z nsight-compute-2024. | 443.1 MB  | ###7       |  37% 
2025-05-07T20:25:33.7140627Z 
2025-05-07T20:25:33.7840599Z libcublas-12.6.4.1   | 256.2 MB  | ######     |  60% [A
2025-05-07T20:25:33.8142229Z nsight-compute-2024. | 443.1 MB  | ###8       |  38% 
2025-05-07T20:25:33.8144538Z 
2025-05-07T20:25:33.8842988Z libcublas-12.6.4.1   | 256.2 MB  | ######2    |  62% [A
2025-05-07T20:25:33.9143219Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:25:33.9143812Z 
2025-05-07T20:25:33.9849584Z libcublas-12.6.4.1   | 256.2 MB  | ######4    |  64% [A
2025-05-07T20:25:34.0143536Z nsight-compute-2024. | 443.1 MB  | ####       |  40% 
2025-05-07T20:25:34.0145239Z 
2025-05-07T20:25:34.0853590Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  66% [A
2025-05-07T20:25:34.1181275Z nsight-compute-2024. | 443.1 MB  | ####1      |  41% 
2025-05-07T20:25:34.1181636Z 
2025-05-07T20:25:34.1857429Z libcublas-12.6.4.1   | 256.2 MB  | ######7    |  68% [A
2025-05-07T20:25:34.2182199Z nsight-compute-2024. | 443.1 MB  | ####2      |  43% 
2025-05-07T20:25:34.2184517Z 
2025-05-07T20:25:34.2858922Z libcublas-12.6.4.1   | 256.2 MB  | ######9    |  69% [A
2025-05-07T20:25:34.3188310Z nsight-compute-2024. | 443.1 MB  | ####3      |  44% 
2025-05-07T20:25:34.3188754Z 
2025-05-07T20:25:34.3861615Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  71% [A
2025-05-07T20:25:34.4192383Z nsight-compute-2024. | 443.1 MB  | ####4      |  45% 
2025-05-07T20:25:34.4195165Z 
2025-05-07T20:25:34.4865501Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  73% [A
2025-05-07T20:25:34.5213327Z nsight-compute-2024. | 443.1 MB  | ####5      |  46% 
2025-05-07T20:25:34.5213676Z 
2025-05-07T20:25:34.5955680Z libcublas-12.6.4.1   | 256.2 MB  | #######4   |  75% [A
2025-05-07T20:25:34.6214464Z nsight-compute-2024. | 443.1 MB  | ####7      |  47% 
2025-05-07T20:25:34.6214818Z 
2025-05-07T20:25:34.7216631Z libcublas-12.6.4.1   | 256.2 MB  | #######6   |  77% [A
2025-05-07T20:25:34.7217022Z 
2025-05-07T20:25:34.7807100Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  79% [A
2025-05-07T20:25:34.8219088Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:25:34.8219434Z 
2025-05-07T20:25:34.8911308Z libcublas-12.6.4.1   | 256.2 MB  | ########1  |  81% [A
2025-05-07T20:25:34.9200199Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:25:34.9200546Z 
2025-05-07T20:25:34.9200668Z 
2025-05-07T20:25:34.9200675Z 
2025-05-07T20:25:34.9200732Z 
2025-05-07T20:25:34.9201234Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.9201605Z 
2025-05-07T20:25:34.9201611Z 
2025-05-07T20:25:34.9201635Z 
2025-05-07T20:25:34.9201641Z 
2025-05-07T20:25:34.9224202Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.9224580Z 
2025-05-07T20:25:34.9899019Z libcublas-12.6.4.1   | 256.2 MB  | ########3  |  83% [A
2025-05-07T20:25:34.9899379Z 
2025-05-07T20:25:34.9899385Z 
2025-05-07T20:25:34.9899423Z 
2025-05-07T20:25:34.9899463Z 
2025-05-07T20:25:34.9899500Z 
2025-05-07T20:25:34.9912093Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:35.0610697Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:25:35.0614153Z 
2025-05-07T20:25:35.0899066Z libcublas-12.6.4.1   | 256.2 MB  | ########5  |  86% [A
2025-05-07T20:25:35.0899437Z 
2025-05-07T20:25:35.0899714Z 
2025-05-07T20:25:35.0899721Z 
2025-05-07T20:25:35.0899725Z 
2025-05-07T20:25:35.0899775Z 
2025-05-07T20:25:35.0994465Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   3% [A[A[A[A[A
2025-05-07T20:25:35.1688111Z nsight-compute-2024. | 443.1 MB  | #####      |  51% 
2025-05-07T20:25:35.1688468Z 
2025-05-07T20:25:35.1688479Z 
2025-05-07T20:25:35.1692209Z 
2025-05-07T20:25:35.1900616Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:35.1901008Z 
2025-05-07T20:25:35.1901013Z 
2025-05-07T20:25:35.1901019Z 
2025-05-07T20:25:35.1901024Z 
2025-05-07T20:25:35.1902673Z 
2025-05-07T20:25:35.1994888Z cuda-nvvp-12.6.80    | 109.3 MB  | 6          |   7% [A[A[A[A[A
2025-05-07T20:25:35.1995885Z 
2025-05-07T20:25:35.2042480Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  87% [A
2025-05-07T20:25:35.2158622Z nsight-compute-2024. | 443.1 MB  | #####1     |  52% 
2025-05-07T20:25:35.2158993Z 
2025-05-07T20:25:35.2158999Z 
2025-05-07T20:25:35.2159004Z 
2025-05-07T20:25:35.2159009Z 
2025-05-07T20:25:35.2159014Z 
2025-05-07T20:25:35.2161887Z 
2025-05-07T20:25:35.2983303Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:35.2983715Z 
2025-05-07T20:25:35.2983720Z 
2025-05-07T20:25:35.2983725Z 
2025-05-07T20:25:35.2983730Z 
2025-05-07T20:25:35.2983734Z 
2025-05-07T20:25:35.3165471Z cuda-nvvp-12.6.80    | 109.3 MB  | 9          |   9% [A[A[A[A[A
2025-05-07T20:25:35.3165856Z 
2025-05-07T20:25:35.3165861Z 
2025-05-07T20:25:35.3165866Z 
2025-05-07T20:25:35.3165871Z 
2025-05-07T20:25:35.3165876Z 
2025-05-07T20:25:35.3170113Z 
2025-05-07T20:25:35.3408736Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:25:35.3484745Z nsight-compute-2024. | 443.1 MB  | #####2     |  53% 
2025-05-07T20:25:35.3485118Z 
2025-05-07T20:25:35.3985714Z libcublas-12.6.4.1   | 256.2 MB  | ########9  |  89% [A
2025-05-07T20:25:35.3986071Z 
2025-05-07T20:25:35.3986077Z 
2025-05-07T20:25:35.3986082Z 
2025-05-07T20:25:35.3986096Z 
2025-05-07T20:25:35.3987461Z 
2025-05-07T20:25:35.4173963Z cuda-nvvp-12.6.80    | 109.3 MB  | #2         |  12% [A[A[A[A[A
2025-05-07T20:25:35.4174340Z 
2025-05-07T20:25:35.4174351Z 
2025-05-07T20:25:35.4174356Z 
2025-05-07T20:25:35.4174373Z 
2025-05-07T20:25:35.4174378Z 
2025-05-07T20:25:35.4176817Z 
2025-05-07T20:25:35.4720260Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:25:35.4818358Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:25:35.4818718Z 
2025-05-07T20:25:35.5107628Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  91% [A
2025-05-07T20:25:35.5107974Z 
2025-05-07T20:25:35.5108005Z 
2025-05-07T20:25:35.5108011Z 
2025-05-07T20:25:35.5108016Z 
2025-05-07T20:25:35.5108026Z 
2025-05-07T20:25:35.5175944Z cuda-nvvp-12.6.80    | 109.3 MB  | #4         |  15% [A[A[A[A[A
2025-05-07T20:25:35.5176330Z 
2025-05-07T20:25:35.5176336Z 
2025-05-07T20:25:35.5176342Z 
2025-05-07T20:25:35.5176347Z 
2025-05-07T20:25:35.5176352Z 
2025-05-07T20:25:35.5179541Z 
2025-05-07T20:25:35.5950859Z libcusolver-11.7.1.2 | 95.8 MB   | 7          |   8% [A[A[A[A[A[A
2025-05-07T20:25:35.5953619Z 
2025-05-07T20:25:35.5993403Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  92% [A
2025-05-07T20:25:35.6107837Z nsight-compute-2024. | 443.1 MB  | #####4     |  54% 
2025-05-07T20:25:35.6108203Z 
2025-05-07T20:25:35.6108209Z 
2025-05-07T20:25:35.6108214Z 
2025-05-07T20:25:35.6108219Z 
2025-05-07T20:25:35.6108224Z 
2025-05-07T20:25:35.6177331Z cuda-nvvp-12.6.80    | 109.3 MB  | #7         |  17% [A[A[A[A[A
2025-05-07T20:25:35.6177677Z 
2025-05-07T20:25:35.6177681Z 
2025-05-07T20:25:35.6177703Z 
2025-05-07T20:25:35.6177708Z 
2025-05-07T20:25:35.6177712Z 
2025-05-07T20:25:35.6179360Z 
2025-05-07T20:25:35.7084142Z libcusolver-11.7.1.2 | 95.8 MB   | #          |  11% [A[A[A[A[A[A
2025-05-07T20:25:35.7085460Z 
2025-05-07T20:25:35.7126326Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  94% [A
2025-05-07T20:25:35.7163848Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:25:35.7164091Z 
2025-05-07T20:25:35.7164095Z 
2025-05-07T20:25:35.7164099Z 
2025-05-07T20:25:35.7164103Z 
2025-05-07T20:25:35.7168276Z 
2025-05-07T20:25:35.7182192Z cuda-nvvp-12.6.80    | 109.3 MB  | ##         |  20% [A[A[A[A[A
2025-05-07T20:25:35.7182585Z 
2025-05-07T20:25:35.7182590Z 
2025-05-07T20:25:35.7182596Z 
2025-05-07T20:25:35.7182601Z 
2025-05-07T20:25:35.7182606Z 
2025-05-07T20:25:35.7185316Z 
2025-05-07T20:25:35.8167111Z libcusolver-11.7.1.2 | 95.8 MB   | #3         |  14% [A[A[A[A[A[A
2025-05-07T20:25:35.8167412Z 
2025-05-07T20:25:35.8167416Z 
2025-05-07T20:25:35.8167420Z 
2025-05-07T20:25:35.8167661Z 
2025-05-07T20:25:35.8170838Z 
2025-05-07T20:25:35.8185749Z cuda-nvvp-12.6.80    | 109.3 MB  | ##2        |  23% [A[A[A[A[A
2025-05-07T20:25:35.8186249Z 
2025-05-07T20:25:35.8186255Z 
2025-05-07T20:25:35.8186259Z 
2025-05-07T20:25:35.8186262Z 
2025-05-07T20:25:35.8186266Z 
2025-05-07T20:25:35.8189030Z 
2025-05-07T20:25:35.8214507Z libcusolver-11.7.1.2 | 95.8 MB   | #6         |  17% [A[A[A[A[A[A
2025-05-07T20:25:35.8272338Z nsight-compute-2024. | 443.1 MB  | #####5     |  56% 
2025-05-07T20:25:35.8274886Z 
2025-05-07T20:25:35.9188499Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  95% [A
2025-05-07T20:25:35.9188850Z 
2025-05-07T20:25:35.9188854Z 
2025-05-07T20:25:35.9188858Z 
2025-05-07T20:25:35.9188862Z 
2025-05-07T20:25:35.9188866Z 
2025-05-07T20:25:35.9190358Z 
2025-05-07T20:25:35.9201254Z libcusolver-11.7.1.2 | 95.8 MB   | #9         |  20% [A[A[A[A[A[A
2025-05-07T20:25:35.9201665Z 
2025-05-07T20:25:35.9201671Z 
2025-05-07T20:25:35.9201676Z 
2025-05-07T20:25:35.9201709Z 
2025-05-07T20:25:35.9203306Z 
2025-05-07T20:25:35.9303608Z cuda-nvvp-12.6.80    | 109.3 MB  | ##5        |  25% [A[A[A[A[A
2025-05-07T20:25:35.9303919Z 
2025-05-07T20:25:35.9311993Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  96% [A
2025-05-07T20:25:36.0194504Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:25:36.0194801Z 
2025-05-07T20:25:36.0194807Z 
2025-05-07T20:25:36.0194812Z 
2025-05-07T20:25:36.0194817Z 
2025-05-07T20:25:36.0194822Z 
2025-05-07T20:25:36.0194827Z 
2025-05-07T20:25:36.0366733Z libcusolver-11.7.1.2 | 95.8 MB   | ##3        |  23% [A[A[A[A[A[A
2025-05-07T20:25:36.0408305Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:25:36.0408678Z 
2025-05-07T20:25:36.0408684Z 
2025-05-07T20:25:36.0408689Z 
2025-05-07T20:25:36.0408694Z 
2025-05-07T20:25:36.0408699Z 
2025-05-07T20:25:36.0525162Z cuda-nvvp-12.6.80    | 109.3 MB  | ##7        |  28% [A[A[A[A[A
2025-05-07T20:25:36.0530513Z 
2025-05-07T20:25:36.1197298Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  97% [A
2025-05-07T20:25:36.1197716Z 
2025-05-07T20:25:36.1197722Z 
2025-05-07T20:25:36.1197727Z 
2025-05-07T20:25:36.1197733Z 
2025-05-07T20:25:36.1197750Z 
2025-05-07T20:25:36.1197756Z 
2025-05-07T20:25:36.1368627Z libcusolver-11.7.1.2 | 95.8 MB   | ##6        |  26% [A[A[A[A[A[A
2025-05-07T20:25:36.1411862Z nsight-compute-2024. | 443.1 MB  | #####7     |  57% 
2025-05-07T20:25:36.1412223Z 
2025-05-07T20:25:36.1412230Z 
2025-05-07T20:25:36.1412235Z 
2025-05-07T20:25:36.1412250Z 
2025-05-07T20:25:36.1412255Z 
2025-05-07T20:25:36.1620706Z cuda-nvvp-12.6.80    | 109.3 MB  | ###        |  30% [A[A[A[A[A
2025-05-07T20:25:36.1621088Z 
2025-05-07T20:25:36.2226562Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  99% [A
2025-05-07T20:25:36.2226924Z 
2025-05-07T20:25:36.2226930Z 
2025-05-07T20:25:36.2226935Z 
2025-05-07T20:25:36.2226940Z 
2025-05-07T20:25:36.2226945Z 
2025-05-07T20:25:36.2230977Z 
2025-05-07T20:25:36.2415261Z libcusolver-11.7.1.2 | 95.8 MB   | ##9        |  29% [A[A[A[A[A[A
2025-05-07T20:25:36.2415701Z 
2025-05-07T20:25:36.2415707Z 
2025-05-07T20:25:36.2415713Z 
2025-05-07T20:25:36.2415718Z 
2025-05-07T20:25:36.2420048Z 
2025-05-07T20:25:36.2449986Z cuda-nvvp-12.6.80    | 109.3 MB  | ###2       |  33% [A[A[A[A[A
2025-05-07T20:25:36.2633998Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:25:36.2634350Z 
2025-05-07T20:25:36.3229767Z libcublas-12.6.4.1   | 256.2 MB  | #########9 | 100% [A
2025-05-07T20:25:36.3230146Z 
2025-05-07T20:25:36.3230152Z 
2025-05-07T20:25:36.3230158Z 
2025-05-07T20:25:36.3230163Z 
2025-05-07T20:25:36.3230168Z 
2025-05-07T20:25:36.3230173Z 
2025-05-07T20:25:36.3478473Z libcusolver-11.7.1.2 | 95.8 MB   | ###2       |  33% [A[A[A[A[A[A
2025-05-07T20:25:36.3526277Z nsight-compute-2024. | 443.1 MB  | #####8     |  59% 
2025-05-07T20:25:36.3526712Z 
2025-05-07T20:25:36.3526718Z 
2025-05-07T20:25:36.3526723Z 
2025-05-07T20:25:36.3526728Z 
2025-05-07T20:25:36.3529967Z 
2025-05-07T20:25:36.4329677Z cuda-nvvp-12.6.80    | 109.3 MB  | ###5       |  35% [A[A[A[A[A
2025-05-07T20:25:36.4330365Z 
2025-05-07T20:25:36.4330372Z 
2025-05-07T20:25:36.4330377Z 
2025-05-07T20:25:36.4330533Z 
2025-05-07T20:25:36.4330539Z 
2025-05-07T20:25:36.4330544Z 
2025-05-07T20:25:36.4488296Z libcusolver-11.7.1.2 | 95.8 MB   | ###5       |  36% [A[A[A[A[A[A
2025-05-07T20:25:36.4570302Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:25:36.4570659Z 
2025-05-07T20:25:36.4570666Z 
2025-05-07T20:25:36.4570671Z 
2025-05-07T20:25:36.4570676Z 
2025-05-07T20:25:36.4575956Z 
2025-05-07T20:25:36.5331418Z cuda-nvvp-12.6.80    | 109.3 MB  | ###7       |  38% [A[A[A[A[A
2025-05-07T20:25:36.5331703Z 
2025-05-07T20:25:36.5331707Z 
2025-05-07T20:25:36.5331711Z 
2025-05-07T20:25:36.5331715Z 
2025-05-07T20:25:36.5331730Z 
2025-05-07T20:25:36.5332891Z 
2025-05-07T20:25:36.5493021Z libcusolver-11.7.1.2 | 95.8 MB   | ###9       |  39% [A[A[A[A[A[A
2025-05-07T20:25:36.5643382Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:25:36.5643656Z 
2025-05-07T20:25:36.5643661Z 
2025-05-07T20:25:36.5643665Z 
2025-05-07T20:25:36.5643668Z 
2025-05-07T20:25:36.5647210Z 
2025-05-07T20:25:36.6378210Z cuda-nvvp-12.6.80    | 109.3 MB  | ####       |  40% [A[A[A[A[A
2025-05-07T20:25:36.6386202Z 
2025-05-07T20:25:36.6386207Z 
2025-05-07T20:25:36.6386221Z 
2025-05-07T20:25:36.6386225Z 
2025-05-07T20:25:36.6386229Z 
2025-05-07T20:25:36.6386232Z 
2025-05-07T20:25:36.6495149Z libcusolver-11.7.1.2 | 95.8 MB   | ####2      |  42% [A[A[A[A[A[A
2025-05-07T20:25:36.6645003Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:25:36.6645347Z 
2025-05-07T20:25:36.6645353Z 
2025-05-07T20:25:36.6645358Z 
2025-05-07T20:25:36.6645363Z 
2025-05-07T20:25:36.6646772Z 
2025-05-07T20:25:36.7380327Z cuda-nvvp-12.6.80    | 109.3 MB  | ####2      |  43% [A[A[A[A[A
2025-05-07T20:25:36.7380718Z 
2025-05-07T20:25:36.7380723Z 
2025-05-07T20:25:36.7380729Z 
2025-05-07T20:25:36.7380734Z 
2025-05-07T20:25:36.7380739Z 
2025-05-07T20:25:36.7380834Z 
2025-05-07T20:25:36.7499665Z libcusolver-11.7.1.2 | 95.8 MB   | ####5      |  46% [A[A[A[A[A[A
2025-05-07T20:25:36.7650789Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:25:36.7651152Z 
2025-05-07T20:25:36.7651158Z 
2025-05-07T20:25:36.7651164Z 
2025-05-07T20:25:36.7651169Z 
2025-05-07T20:25:36.7652923Z 
2025-05-07T20:25:36.8380780Z cuda-nvvp-12.6.80    | 109.3 MB  | ####5      |  46% [A[A[A[A[A
2025-05-07T20:25:36.8381142Z 
2025-05-07T20:25:36.8381147Z 
2025-05-07T20:25:36.8381151Z 
2025-05-07T20:25:36.8381155Z 
2025-05-07T20:25:36.8381158Z 
2025-05-07T20:25:36.8385879Z 
2025-05-07T20:25:36.8504838Z libcusolver-11.7.1.2 | 95.8 MB   | ####9      |  49% [A[A[A[A[A[A
2025-05-07T20:25:36.8655770Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:25:36.8656164Z 
2025-05-07T20:25:36.8656172Z 
2025-05-07T20:25:36.8656179Z 
2025-05-07T20:25:36.8656185Z 
2025-05-07T20:25:36.8662418Z 
2025-05-07T20:25:36.9390652Z cuda-nvvp-12.6.80    | 109.3 MB  | ####8      |  49% [A[A[A[A[A
2025-05-07T20:25:36.9391016Z 
2025-05-07T20:25:36.9391021Z 
2025-05-07T20:25:36.9391025Z 
2025-05-07T20:25:36.9391040Z 
2025-05-07T20:25:36.9391044Z 
2025-05-07T20:25:36.9391047Z 
2025-05-07T20:25:36.9515456Z libcusolver-11.7.1.2 | 95.8 MB   | #####2     |  52% [A[A[A[A[A[A
2025-05-07T20:25:36.9656797Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:25:36.9657094Z 
2025-05-07T20:25:36.9657167Z 
2025-05-07T20:25:36.9657173Z 
2025-05-07T20:25:36.9657178Z 
2025-05-07T20:25:36.9658487Z 
2025-05-07T20:25:37.0399823Z cuda-nvvp-12.6.80    | 109.3 MB  | #####1     |  51% [A[A[A[A[A
2025-05-07T20:25:37.0400131Z 
2025-05-07T20:25:37.0400135Z 
2025-05-07T20:25:37.0400139Z 
2025-05-07T20:25:37.0400143Z 
2025-05-07T20:25:37.0400147Z 
2025-05-07T20:25:37.0400151Z 
2025-05-07T20:25:37.0517238Z libcusolver-11.7.1.2 | 95.8 MB   | #####5     |  56% [A[A[A[A[A[A
2025-05-07T20:25:37.0674856Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:25:37.0675446Z 
2025-05-07T20:25:37.0675451Z 
2025-05-07T20:25:37.0675455Z 
2025-05-07T20:25:37.0675459Z 
2025-05-07T20:25:37.0678289Z 
2025-05-07T20:25:37.1413293Z cuda-nvvp-12.6.80    | 109.3 MB  | #####4     |  54% [A[A[A[A[A
2025-05-07T20:25:37.1413683Z 
2025-05-07T20:25:37.1413687Z 
2025-05-07T20:25:37.1413691Z 
2025-05-07T20:25:37.1413708Z 
2025-05-07T20:25:37.1413712Z 
2025-05-07T20:25:37.1416294Z 
2025-05-07T20:25:37.1518862Z libcusolver-11.7.1.2 | 95.8 MB   | #####9     |  59% [A[A[A[A[A[A
2025-05-07T20:25:37.1797544Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:25:37.1797791Z 
2025-05-07T20:25:37.1797886Z 
2025-05-07T20:25:37.1797893Z 
2025-05-07T20:25:37.1797897Z 
2025-05-07T20:25:37.1802364Z 
2025-05-07T20:25:37.2419908Z cuda-nvvp-12.6.80    | 109.3 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:25:37.2420212Z 
2025-05-07T20:25:37.2420215Z 
2025-05-07T20:25:37.2420220Z 
2025-05-07T20:25:37.2420223Z 
2025-05-07T20:25:37.2420227Z 
2025-05-07T20:25:37.2420249Z 
2025-05-07T20:25:37.2521791Z libcusolver-11.7.1.2 | 95.8 MB   | ######2    |  63% [A[A[A[A[A[A
2025-05-07T20:25:37.2847563Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:25:37.2847828Z 
2025-05-07T20:25:37.2847832Z 
2025-05-07T20:25:37.2847836Z 
2025-05-07T20:25:37.2847839Z 
2025-05-07T20:25:37.2851323Z 
2025-05-07T20:25:37.3577002Z cuda-nvvp-12.6.80    | 109.3 MB  | #####9     |  59% [A[A[A[A[A
2025-05-07T20:25:37.3583969Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:25:37.3584223Z 
2025-05-07T20:25:37.3584228Z 
2025-05-07T20:25:37.3584242Z 
2025-05-07T20:25:37.3584245Z 
2025-05-07T20:25:37.3584249Z 
2025-05-07T20:25:37.3584352Z 
2025-05-07T20:25:37.3924039Z libcusolver-11.7.1.2 | 95.8 MB   | ######5    |  66% [A[A[A[A[A[A
2025-05-07T20:25:37.3924343Z 
2025-05-07T20:25:37.3924347Z 
2025-05-07T20:25:37.3924351Z 
2025-05-07T20:25:37.3924355Z 
2025-05-07T20:25:37.3924358Z 
2025-05-07T20:25:37.4583875Z cuda-nvvp-12.6.80    | 109.3 MB  | ######1    |  62% [A[A[A[A[A
2025-05-07T20:25:37.4690896Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:25:37.4691150Z 
2025-05-07T20:25:37.4691167Z 
2025-05-07T20:25:37.4691172Z 
2025-05-07T20:25:37.4691176Z 
2025-05-07T20:25:37.4691180Z 
2025-05-07T20:25:37.4693441Z 
2025-05-07T20:25:37.4926915Z libcusolver-11.7.1.2 | 95.8 MB   | ######9    |  69% [A[A[A[A[A[A
2025-05-07T20:25:37.4927251Z 
2025-05-07T20:25:37.4927254Z 
2025-05-07T20:25:37.4927258Z 
2025-05-07T20:25:37.4927262Z 
2025-05-07T20:25:37.4932335Z 
2025-05-07T20:25:37.5687003Z cuda-nvvp-12.6.80    | 109.3 MB  | ######4    |  64% [A[A[A[A[A
2025-05-07T20:25:37.5870924Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:25:37.5871264Z 
2025-05-07T20:25:37.5871270Z 
2025-05-07T20:25:37.5871275Z 
2025-05-07T20:25:37.5871280Z 
2025-05-07T20:25:37.5871285Z 
2025-05-07T20:25:37.5871290Z 
2025-05-07T20:25:37.5936328Z libcusolver-11.7.1.2 | 95.8 MB   | #######2   |  72% [A[A[A[A[A[A
2025-05-07T20:25:37.5936638Z 
2025-05-07T20:25:37.5936642Z 
2025-05-07T20:25:37.5936646Z 
2025-05-07T20:25:37.5936649Z 
2025-05-07T20:25:37.5943852Z 
2025-05-07T20:25:37.6732709Z cuda-nvvp-12.6.80    | 109.3 MB  | ######6    |  67% [A[A[A[A[A
2025-05-07T20:25:37.6874316Z nsight-compute-2024. | 443.1 MB  | ######7    |  68% 
2025-05-07T20:25:37.6874599Z 
2025-05-07T20:25:37.6874605Z 
2025-05-07T20:25:37.6874610Z 
2025-05-07T20:25:37.6874615Z 
2025-05-07T20:25:37.6874620Z 
2025-05-07T20:25:37.6879777Z 
2025-05-07T20:25:37.6940825Z libcusolver-11.7.1.2 | 95.8 MB   | #######5   |  75% [A[A[A[A[A[A
2025-05-07T20:25:37.6941116Z 
2025-05-07T20:25:37.6941120Z 
2025-05-07T20:25:37.6941124Z 
2025-05-07T20:25:37.6941128Z 
2025-05-07T20:25:37.6941132Z 
2025-05-07T20:25:37.7735373Z cuda-nvvp-12.6.80    | 109.3 MB  | ######9    |  69% [A[A[A[A[A
2025-05-07T20:25:37.7880550Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:25:37.7880811Z 
2025-05-07T20:25:37.7880815Z 
2025-05-07T20:25:37.7880818Z 
2025-05-07T20:25:37.7881097Z 
2025-05-07T20:25:37.7881101Z 
2025-05-07T20:25:37.7881546Z 
2025-05-07T20:25:37.7944470Z libcusolver-11.7.1.2 | 95.8 MB   | #######8   |  79% [A[A[A[A[A[A
2025-05-07T20:25:37.7944784Z 
2025-05-07T20:25:37.7944790Z 
2025-05-07T20:25:37.7944795Z 
2025-05-07T20:25:37.7944799Z 
2025-05-07T20:25:37.7946850Z 
2025-05-07T20:25:37.8326490Z cuda-nvvp-12.6.80    | 109.3 MB  | #######2   |  72% [A[A[A[A[A
2025-05-07T20:25:37.8326871Z 
2025-05-07T20:25:37.8329957Z 
2025-05-07T20:25:37.8840417Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:37.8866268Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:25:37.8866632Z 
2025-05-07T20:25:37.8866638Z 
2025-05-07T20:25:37.8866643Z 
2025-05-07T20:25:37.8866648Z 
2025-05-07T20:25:37.8866653Z 
2025-05-07T20:25:37.8866658Z 
2025-05-07T20:25:37.8870242Z 
2025-05-07T20:25:37.8952716Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:37.8953111Z 
2025-05-07T20:25:37.8953134Z 
2025-05-07T20:25:37.8953140Z 
2025-05-07T20:25:37.8953145Z 
2025-05-07T20:25:37.8953151Z 
2025-05-07T20:25:37.8953156Z 
2025-05-07T20:25:37.9045942Z libcusolver-11.7.1.2 | 95.8 MB   | ########1  |  82% [A[A[A[A[A[A
2025-05-07T20:25:37.9046403Z 
2025-05-07T20:25:37.9046409Z 
2025-05-07T20:25:37.9046414Z 
2025-05-07T20:25:37.9046419Z 
2025-05-07T20:25:37.9048781Z 
2025-05-07T20:25:37.9868448Z cuda-nvvp-12.6.80    | 109.3 MB  | #######4   |  75% [A[A[A[A[A
2025-05-07T20:25:37.9868840Z 
2025-05-07T20:25:37.9868846Z 
2025-05-07T20:25:37.9868851Z 
2025-05-07T20:25:37.9868856Z 
2025-05-07T20:25:37.9868861Z 
2025-05-07T20:25:37.9868866Z 
2025-05-07T20:25:37.9872799Z 
2025-05-07T20:25:38.0075174Z libnpp-12.3.1.54     | 93.4 MB   | 3          |   3% [A[A[A[A[A[A[A
2025-05-07T20:25:38.0091454Z nsight-compute-2024. | 443.1 MB  | ######9    |  70% 
2025-05-07T20:25:38.0091814Z 
2025-05-07T20:25:38.0091821Z 
2025-05-07T20:25:38.0091826Z 
2025-05-07T20:25:38.0091832Z 
2025-05-07T20:25:38.0091850Z 
2025-05-07T20:25:38.0145835Z cuda-nvvp-12.6.80    | 109.3 MB  | #######7   |  77% [A[A[A[A[A
2025-05-07T20:25:38.0146213Z 
2025-05-07T20:25:38.0146232Z 
2025-05-07T20:25:38.0146237Z 
2025-05-07T20:25:38.0146242Z 
2025-05-07T20:25:38.0146247Z 
2025-05-07T20:25:38.0147812Z 
2025-05-07T20:25:38.0869196Z libcusolver-11.7.1.2 | 95.8 MB   | ########4  |  85% [A[A[A[A[A[A
2025-05-07T20:25:38.0869598Z 
2025-05-07T20:25:38.0869603Z 
2025-05-07T20:25:38.0869608Z 
2025-05-07T20:25:38.0869613Z 
2025-05-07T20:25:38.0869618Z 
2025-05-07T20:25:38.0869623Z 
2025-05-07T20:25:38.0871280Z 
2025-05-07T20:25:38.1175727Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   6% [A[A[A[A[A[A[A
2025-05-07T20:25:38.1300880Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:25:38.1301236Z 
2025-05-07T20:25:38.1301242Z 
2025-05-07T20:25:38.1301247Z 
2025-05-07T20:25:38.1301252Z 
2025-05-07T20:25:38.1301257Z 
2025-05-07T20:25:38.1452915Z cuda-nvvp-12.6.80    | 109.3 MB  | #######9   |  80% [A[A[A[A[A
2025-05-07T20:25:38.1453310Z 
2025-05-07T20:25:38.1453316Z 
2025-05-07T20:25:38.1453321Z 
2025-05-07T20:25:38.1453326Z 
2025-05-07T20:25:38.1453331Z 
2025-05-07T20:25:38.1453346Z 
2025-05-07T20:25:38.1873261Z libcusolver-11.7.1.2 | 95.8 MB   | ########7  |  88% [A[A[A[A[A[A
2025-05-07T20:25:38.1873668Z 
2025-05-07T20:25:38.1873674Z 
2025-05-07T20:25:38.1873679Z 
2025-05-07T20:25:38.1873684Z 
2025-05-07T20:25:38.1873689Z 
2025-05-07T20:25:38.1873694Z 
2025-05-07T20:25:38.1876882Z 
2025-05-07T20:25:38.2226750Z libnpp-12.3.1.54     | 93.4 MB   | 8          |   8% [A[A[A[A[A[A[A
2025-05-07T20:25:38.2304666Z nsight-compute-2024. | 443.1 MB  | #######    |  71% 
2025-05-07T20:25:38.2305020Z 
2025-05-07T20:25:38.2305025Z 
2025-05-07T20:25:38.2305031Z 
2025-05-07T20:25:38.2305036Z 
2025-05-07T20:25:38.2305041Z 
2025-05-07T20:25:38.2635540Z cuda-nvvp-12.6.80    | 109.3 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:25:38.2635929Z 
2025-05-07T20:25:38.2635934Z 
2025-05-07T20:25:38.2636176Z 
2025-05-07T20:25:38.2636181Z 
2025-05-07T20:25:38.2636187Z 
2025-05-07T20:25:38.2636192Z 
2025-05-07T20:25:38.2882455Z libcusolver-11.7.1.2 | 95.8 MB   | #########  |  90% [A[A[A[A[A[A
2025-05-07T20:25:38.2882878Z 
2025-05-07T20:25:38.2882883Z 
2025-05-07T20:25:38.2882888Z 
2025-05-07T20:25:38.2882894Z 
2025-05-07T20:25:38.2882899Z 
2025-05-07T20:25:38.2882904Z 
2025-05-07T20:25:38.2887258Z 
2025-05-07T20:25:38.3290559Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:38.3309314Z nsight-compute-2024. | 443.1 MB  | #######1   |  71% 
2025-05-07T20:25:38.3309671Z 
2025-05-07T20:25:38.3309677Z 
2025-05-07T20:25:38.3309682Z 
2025-05-07T20:25:38.3309687Z 
2025-05-07T20:25:38.3313715Z 
2025-05-07T20:25:38.3716115Z cuda-nvvp-12.6.80    | 109.3 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:25:38.3716505Z 
2025-05-07T20:25:38.3716510Z 
2025-05-07T20:25:38.3716515Z 
2025-05-07T20:25:38.3716520Z 
2025-05-07T20:25:38.3716525Z 
2025-05-07T20:25:38.3716530Z 
2025-05-07T20:25:38.3890351Z libcusolver-11.7.1.2 | 95.8 MB   | #########2 |  93% [A[A[A[A[A[A
2025-05-07T20:25:38.3890727Z 
2025-05-07T20:25:38.3890731Z 
2025-05-07T20:25:38.3890747Z 
2025-05-07T20:25:38.3890751Z 
2025-05-07T20:25:38.3890755Z 
2025-05-07T20:25:38.3890758Z 
2025-05-07T20:25:38.3890851Z 
2025-05-07T20:25:38.4290690Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:25:38.4372678Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:25:38.4373035Z 
2025-05-07T20:25:38.4373041Z 
2025-05-07T20:25:38.4373046Z 
2025-05-07T20:25:38.4373051Z 
2025-05-07T20:25:38.4377291Z 
2025-05-07T20:25:38.4833277Z cuda-nvvp-12.6.80    | 109.3 MB  | ########6  |  87% [A[A[A[A[A
2025-05-07T20:25:38.4833657Z 
2025-05-07T20:25:38.4833661Z 
2025-05-07T20:25:38.4833664Z 
2025-05-07T20:25:38.4833668Z 
2025-05-07T20:25:38.4833672Z 
2025-05-07T20:25:38.4833676Z 
2025-05-07T20:25:38.4899363Z libcusolver-11.7.1.2 | 95.8 MB   | #########5 |  95% [A[A[A[A[A[A
2025-05-07T20:25:38.4899717Z 
2025-05-07T20:25:38.4899722Z 
2025-05-07T20:25:38.4899725Z 
2025-05-07T20:25:38.4899729Z 
2025-05-07T20:25:38.4899733Z 
2025-05-07T20:25:38.4899745Z 
2025-05-07T20:25:38.4899749Z 
2025-05-07T20:25:38.5299980Z libnpp-12.3.1.54     | 93.4 MB   | #6         |  17% [A[A[A[A[A[A[A
2025-05-07T20:25:38.5449731Z nsight-compute-2024. | 443.1 MB  | #######2   |  73% 
2025-05-07T20:25:38.5450082Z 
2025-05-07T20:25:38.5450087Z 
2025-05-07T20:25:38.5450091Z 
2025-05-07T20:25:38.5450095Z 
2025-05-07T20:25:38.5452035Z 
2025-05-07T20:25:38.5869639Z cuda-nvvp-12.6.80    | 109.3 MB  | ########9  |  89% [A[A[A[A[A
2025-05-07T20:25:38.5870011Z 
2025-05-07T20:25:38.5870015Z 
2025-05-07T20:25:38.5870019Z 
2025-05-07T20:25:38.5870023Z 
2025-05-07T20:25:38.5870027Z 
2025-05-07T20:25:38.5870031Z 
2025-05-07T20:25:38.5939973Z libcusolver-11.7.1.2 | 95.8 MB   | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:25:38.5940290Z 
2025-05-07T20:25:38.5940294Z 
2025-05-07T20:25:38.5940298Z 
2025-05-07T20:25:38.5940312Z 
2025-05-07T20:25:38.5940320Z 
2025-05-07T20:25:38.5940325Z 
2025-05-07T20:25:38.5940330Z 
2025-05-07T20:25:38.6341635Z libnpp-12.3.1.54     | 93.4 MB   | #9         |  20% [A[A[A[A[A[A[A
2025-05-07T20:25:38.6413203Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:25:38.6413565Z 
2025-05-07T20:25:38.6413572Z 
2025-05-07T20:25:38.6413577Z 
2025-05-07T20:25:38.6413583Z 
2025-05-07T20:25:38.6472882Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:38.6473155Z 
2025-05-07T20:25:38.6473159Z 
2025-05-07T20:25:38.6473169Z 
2025-05-07T20:25:38.6473173Z 
2025-05-07T20:25:38.6476492Z 
2025-05-07T20:25:38.6940685Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  92% [A[A[A[A[A
2025-05-07T20:25:38.6940954Z 
2025-05-07T20:25:38.6940966Z 
2025-05-07T20:25:38.6940970Z 
2025-05-07T20:25:38.6940973Z 
2025-05-07T20:25:38.6940977Z 
2025-05-07T20:25:38.6940981Z 
2025-05-07T20:25:38.6946390Z 
2025-05-07T20:25:38.7341896Z libnpp-12.3.1.54     | 93.4 MB   | ##2        |  23% [A[A[A[A[A[A[A
2025-05-07T20:25:38.7473370Z nsight-compute-2024. | 443.1 MB  | #######3   |  74% 
2025-05-07T20:25:38.7473614Z 
2025-05-07T20:25:38.7473784Z 
2025-05-07T20:25:38.7473789Z 
2025-05-07T20:25:38.7473793Z 
2025-05-07T20:25:38.7475249Z 
2025-05-07T20:25:38.7949925Z cuda-nvvp-12.6.80    | 109.3 MB  | #########4 |  95% [A[A[A[A[A
2025-05-07T20:25:38.7950194Z 
2025-05-07T20:25:38.7950198Z 
2025-05-07T20:25:38.7950201Z 
2025-05-07T20:25:38.7950213Z 
2025-05-07T20:25:38.7950217Z 
2025-05-07T20:25:38.7950221Z 
2025-05-07T20:25:38.7951996Z 
2025-05-07T20:25:38.8346270Z libnpp-12.3.1.54     | 93.4 MB   | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:25:38.8474901Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:25:38.8475147Z 
2025-05-07T20:25:38.8475151Z 
2025-05-07T20:25:38.8475155Z 
2025-05-07T20:25:38.8475165Z 
2025-05-07T20:25:38.8476681Z 
2025-05-07T20:25:38.8950298Z cuda-nvvp-12.6.80    | 109.3 MB  | #########8 |  98% [A[A[A[A[A
2025-05-07T20:25:38.8950588Z 
2025-05-07T20:25:38.8950592Z 
2025-05-07T20:25:38.8950603Z 
2025-05-07T20:25:38.8950607Z 
2025-05-07T20:25:38.8950611Z 
2025-05-07T20:25:38.8950620Z 
2025-05-07T20:25:38.8955426Z 
2025-05-07T20:25:38.9346983Z libnpp-12.3.1.54     | 93.4 MB   | ##9        |  29% [A[A[A[A[A[A[A
2025-05-07T20:25:38.9952321Z nsight-compute-2024. | 443.1 MB  | #######5   |  75% 
2025-05-07T20:25:38.9952699Z 
2025-05-07T20:25:38.9952705Z 
2025-05-07T20:25:38.9952710Z 
2025-05-07T20:25:38.9952716Z 
2025-05-07T20:25:38.9952720Z 
2025-05-07T20:25:38.9952726Z 
2025-05-07T20:25:38.9952730Z 
2025-05-07T20:25:39.0352117Z libnpp-12.3.1.54     | 93.4 MB   | ###2       |  33% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0956382Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:25:39.0956737Z 
2025-05-07T20:25:39.0956742Z 
2025-05-07T20:25:39.0956746Z 
2025-05-07T20:25:39.0956749Z 
2025-05-07T20:25:39.0956753Z 
2025-05-07T20:25:39.0956757Z 
2025-05-07T20:25:39.0958781Z 
2025-05-07T20:25:39.1353323Z libnpp-12.3.1.54     | 93.4 MB   | ###6       |  36% [A[A[A[A[A[A[A
2025-05-07T20:25:39.1962515Z nsight-compute-2024. | 443.1 MB  | #######6   |  77% 
2025-05-07T20:25:39.1962789Z 
2025-05-07T20:25:39.1962793Z 
2025-05-07T20:25:39.1962797Z 
2025-05-07T20:25:39.1962801Z 
2025-05-07T20:25:39.1962805Z 
2025-05-07T20:25:39.1962815Z 
2025-05-07T20:25:39.1965751Z 
2025-05-07T20:25:39.2354731Z libnpp-12.3.1.54     | 93.4 MB   | ###9       |  40% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2968064Z nsight-compute-2024. | 443.1 MB  | #######7   |  78% 
2025-05-07T20:25:39.2968419Z 
2025-05-07T20:25:39.2968424Z 
2025-05-07T20:25:39.2968427Z 
2025-05-07T20:25:39.2968431Z 
2025-05-07T20:25:39.2968434Z 
2025-05-07T20:25:39.2968438Z 
2025-05-07T20:25:39.2968441Z 
2025-05-07T20:25:39.3357704Z libnpp-12.3.1.54     | 93.4 MB   | ####3      |  43% [A[A[A[A[A[A[A
2025-05-07T20:25:39.3968263Z nsight-compute-2024. | 443.1 MB  | #######8   |  79% 
2025-05-07T20:25:39.3968619Z 
2025-05-07T20:25:39.3968626Z 
2025-05-07T20:25:39.3968646Z 
2025-05-07T20:25:39.3968652Z 
2025-05-07T20:25:39.3968796Z 
2025-05-07T20:25:39.3968801Z 
2025-05-07T20:25:39.3970232Z 
2025-05-07T20:25:39.4358778Z libnpp-12.3.1.54     | 93.4 MB   | ####6      |  47% [A[A[A[A[A[A[A
2025-05-07T20:25:39.4974420Z nsight-compute-2024. | 443.1 MB  | #######9   |  80% 
2025-05-07T20:25:39.4974804Z 
2025-05-07T20:25:39.4974810Z 
2025-05-07T20:25:39.4974816Z 
2025-05-07T20:25:39.4974824Z 
2025-05-07T20:25:39.4974966Z 
2025-05-07T20:25:39.4974973Z 
2025-05-07T20:25:39.4976584Z 
2025-05-07T20:25:39.5359019Z libnpp-12.3.1.54     | 93.4 MB   | #####      |  51% [A[A[A[A[A[A[A
2025-05-07T20:25:39.5997286Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:25:39.5997648Z 
2025-05-07T20:25:39.5997652Z 
2025-05-07T20:25:39.5997656Z 
2025-05-07T20:25:39.5997660Z 
2025-05-07T20:25:39.5997663Z 
2025-05-07T20:25:39.5997667Z 
2025-05-07T20:25:39.6000003Z 
2025-05-07T20:25:39.6380077Z libnpp-12.3.1.54     | 93.4 MB   | #####4     |  55% [A[A[A[A[A[A[A
2025-05-07T20:25:39.7046325Z nsight-compute-2024. | 443.1 MB  | ########1  |  81% 
2025-05-07T20:25:39.7046609Z 
2025-05-07T20:25:39.7046824Z 
2025-05-07T20:25:39.7046832Z 
2025-05-07T20:25:39.7046837Z 
2025-05-07T20:25:39.7046842Z 
2025-05-07T20:25:39.7046847Z 
2025-05-07T20:25:39.7049248Z 
2025-05-07T20:25:39.7382894Z libnpp-12.3.1.54     | 93.4 MB   | #####8     |  58% [A[A[A[A[A[A[A
2025-05-07T20:25:39.8046446Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:25:39.8046789Z 
2025-05-07T20:25:39.8046793Z 
2025-05-07T20:25:39.8046796Z 
2025-05-07T20:25:39.8046800Z 
2025-05-07T20:25:39.8046804Z 
2025-05-07T20:25:39.8046808Z 
2025-05-07T20:25:39.8048434Z 
2025-05-07T20:25:39.8435065Z libnpp-12.3.1.54     | 93.4 MB   | ######2    |  62% [A[A[A[A[A[A[A
2025-05-07T20:25:39.9048099Z nsight-compute-2024. | 443.1 MB  | ########2  |  83% 
2025-05-07T20:25:39.9048410Z 
2025-05-07T20:25:39.9048415Z 
2025-05-07T20:25:39.9048419Z 
2025-05-07T20:25:39.9048422Z 
2025-05-07T20:25:39.9048444Z 
2025-05-07T20:25:39.9048448Z 
2025-05-07T20:25:39.9049695Z 
2025-05-07T20:25:40.0054381Z libnpp-12.3.1.54     | 93.4 MB   | ######6    |  66% [A[A[A[A[A[A[A
2025-05-07T20:25:40.0054671Z 
2025-05-07T20:25:40.0054677Z 
2025-05-07T20:25:40.0054681Z 
2025-05-07T20:25:40.0054685Z 
2025-05-07T20:25:40.0054689Z 
2025-05-07T20:25:40.0054701Z 
2025-05-07T20:25:40.0054708Z 
2025-05-07T20:25:40.0985853Z libnpp-12.3.1.54     | 93.4 MB   | #######    |  71% [A[A[A[A[A[A[A
2025-05-07T20:25:40.1985694Z nsight-compute-2024. | 443.1 MB  | ########3  |  84% 
2025-05-07T20:25:40.2746789Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:25:40.2747059Z 
2025-05-07T20:25:40.2747130Z 
2025-05-07T20:25:40.2747137Z 
2025-05-07T20:25:40.2747143Z 
2025-05-07T20:25:40.2747148Z 
2025-05-07T20:25:40.2747153Z 
2025-05-07T20:25:40.2749671Z 
2025-05-07T20:25:40.2987429Z libnpp-12.3.1.54     | 93.4 MB   | #######4   |  75% [A[A[A[A[A[A[A
2025-05-07T20:25:40.3747425Z nsight-compute-2024. | 443.1 MB  | ########5  |  85% 
2025-05-07T20:25:40.3747698Z 
2025-05-07T20:25:40.3747779Z 
2025-05-07T20:25:40.3747783Z 
2025-05-07T20:25:40.3747904Z 
2025-05-07T20:25:40.3747930Z 
2025-05-07T20:25:40.3747936Z 
2025-05-07T20:25:40.3750635Z 
2025-05-07T20:25:40.3988777Z libnpp-12.3.1.54     | 93.4 MB   | #######7   |  78% [A[A[A[A[A[A[A
2025-05-07T20:25:40.4747610Z nsight-compute-2024. | 443.1 MB  | ########6  |  86% 
2025-05-07T20:25:40.4747875Z 
2025-05-07T20:25:40.4747879Z 
2025-05-07T20:25:40.4747883Z 
2025-05-07T20:25:40.4747887Z 
2025-05-07T20:25:40.4747891Z 
2025-05-07T20:25:40.4747894Z 
2025-05-07T20:25:40.4748010Z 
2025-05-07T20:25:40.5001783Z libnpp-12.3.1.54     | 93.4 MB   | ########2  |  82% [A[A[A[A[A[A[A
2025-05-07T20:25:40.5749520Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:25:40.5749779Z 
2025-05-07T20:25:40.5749895Z 
2025-05-07T20:25:40.5749900Z 
2025-05-07T20:25:40.5750024Z 
2025-05-07T20:25:40.5750031Z 
2025-05-07T20:25:40.5750037Z 
2025-05-07T20:25:40.5750057Z 
2025-05-07T20:25:40.6006636Z libnpp-12.3.1.54     | 93.4 MB   | ########5  |  86% [A[A[A[A[A[A[A
2025-05-07T20:25:40.6800364Z nsight-compute-2024. | 443.1 MB  | ########7  |  88% 
2025-05-07T20:25:40.6800635Z 
2025-05-07T20:25:40.6800639Z 
2025-05-07T20:25:40.6800643Z 
2025-05-07T20:25:40.6800647Z 
2025-05-07T20:25:40.6800651Z 
2025-05-07T20:25:40.6800654Z 
2025-05-07T20:25:40.6800658Z 
2025-05-07T20:25:40.7011610Z libnpp-12.3.1.54     | 93.4 MB   | ########9  |  90% [A[A[A[A[A[A[A
2025-05-07T20:25:40.7802440Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:25:40.7802708Z 
2025-05-07T20:25:40.7802712Z 
2025-05-07T20:25:40.7802717Z 
2025-05-07T20:25:40.7802720Z 
2025-05-07T20:25:40.7802724Z 
2025-05-07T20:25:40.7802728Z 
2025-05-07T20:25:40.7802866Z 
2025-05-07T20:25:40.8018666Z libnpp-12.3.1.54     | 93.4 MB   | #########3 |  93% [A[A[A[A[A[A[A
2025-05-07T20:25:40.8807613Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:25:40.8808234Z 
2025-05-07T20:25:40.8808240Z 
2025-05-07T20:25:40.8808246Z 
2025-05-07T20:25:40.8808251Z 
2025-05-07T20:25:40.8808257Z 
2025-05-07T20:25:40.8808261Z 
2025-05-07T20:25:40.8812355Z 
2025-05-07T20:25:40.9585658Z libnpp-12.3.1.54     | 93.4 MB   | #########7 |  97% [A[A[A[A[A[A[A
2025-05-07T20:25:41.0737865Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:25:41.1745232Z nsight-compute-2024. | 443.1 MB  | #########1 |  91% 
2025-05-07T20:25:41.2748987Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:25:41.3749462Z nsight-compute-2024. | 443.1 MB  | #########3 |  93% 
2025-05-07T20:25:41.4695320Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:25:41.4695577Z 
2025-05-07T20:25:41.4695906Z 
2025-05-07T20:25:41.4695910Z 
2025-05-07T20:25:41.4695922Z 
2025-05-07T20:25:41.4695959Z 
2025-05-07T20:25:41.4695963Z 
2025-05-07T20:25:41.4749773Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:41.5053977Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:25:41.5054297Z 
2025-05-07T20:25:41.5054494Z 
2025-05-07T20:25:41.5054500Z 
2025-05-07T20:25:41.5054514Z 
2025-05-07T20:25:41.5054519Z 
2025-05-07T20:25:41.5054524Z 
2025-05-07T20:25:41.5054529Z 
2025-05-07T20:25:41.5055703Z 
2025-05-07T20:25:41.6041821Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.6069508Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:25:41.6069766Z 
2025-05-07T20:25:41.6069972Z 
2025-05-07T20:25:41.6069977Z 
2025-05-07T20:25:41.6069996Z 
2025-05-07T20:25:41.6070000Z 
2025-05-07T20:25:41.6070004Z 
2025-05-07T20:25:41.6070008Z 
2025-05-07T20:25:41.6074174Z 
2025-05-07T20:25:41.7201608Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 7          |   7% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.7201923Z 
2025-05-07T20:25:41.7201927Z 
2025-05-07T20:25:41.7201931Z 
2025-05-07T20:25:41.7201935Z 
2025-05-07T20:25:41.7201939Z 
2025-05-07T20:25:41.7201943Z 
2025-05-07T20:25:41.7201955Z 
2025-05-07T20:25:41.7204133Z 
2025-05-07T20:25:41.7296318Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #4         |  14% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8278388Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:25:41.8278704Z 
2025-05-07T20:25:41.8278710Z 
2025-05-07T20:25:41.8278715Z 
2025-05-07T20:25:41.8278720Z 
2025-05-07T20:25:41.8278725Z 
2025-05-07T20:25:41.8278730Z 
2025-05-07T20:25:41.8278735Z 
2025-05-07T20:25:41.8282430Z 
2025-05-07T20:25:41.8394983Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##         |  21% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9357584Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:25:41.9357945Z 
2025-05-07T20:25:41.9357951Z 
2025-05-07T20:25:41.9357956Z 
2025-05-07T20:25:41.9357961Z 
2025-05-07T20:25:41.9357966Z 
2025-05-07T20:25:41.9357971Z 
2025-05-07T20:25:41.9357976Z 
2025-05-07T20:25:41.9361579Z 
2025-05-07T20:25:41.9520477Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##7        |  27% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0364145Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:25:42.0364446Z 
2025-05-07T20:25:42.0364450Z 
2025-05-07T20:25:42.0364454Z 
2025-05-07T20:25:42.0364466Z 
2025-05-07T20:25:42.0364470Z 
2025-05-07T20:25:42.0364473Z 
2025-05-07T20:25:42.0364477Z 
2025-05-07T20:25:42.0364681Z 
2025-05-07T20:25:42.0716712Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###4       |  34% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1366486Z nsight-compute-2024. | 443.1 MB  | #########9 |  99% 
2025-05-07T20:25:42.1366760Z 
2025-05-07T20:25:42.1366764Z 
2025-05-07T20:25:42.1366768Z 
2025-05-07T20:25:42.1366771Z 
2025-05-07T20:25:42.1366775Z 
2025-05-07T20:25:42.1366778Z 
2025-05-07T20:25:42.1366782Z 
2025-05-07T20:25:42.1369252Z 
2025-05-07T20:25:42.1739352Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####       |  41% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2378448Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:25:42.2378713Z 
2025-05-07T20:25:42.2378717Z 
2025-05-07T20:25:42.2378721Z 
2025-05-07T20:25:42.2378970Z 
2025-05-07T20:25:42.2378976Z 
2025-05-07T20:25:42.2378981Z 
2025-05-07T20:25:42.2378986Z 
2025-05-07T20:25:42.2383246Z 
2025-05-07T20:25:42.2778015Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2778335Z 
2025-05-07T20:25:42.2778339Z 
2025-05-07T20:25:42.2778343Z 
2025-05-07T20:25:42.2778346Z 
2025-05-07T20:25:42.2778350Z 
2025-05-07T20:25:42.3383161Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:42.3383468Z 
2025-05-07T20:25:42.3383474Z 
2025-05-07T20:25:42.3383479Z 
2025-05-07T20:25:42.3383484Z 
2025-05-07T20:25:42.3383489Z 
2025-05-07T20:25:42.3383494Z 
2025-05-07T20:25:42.3383510Z 
2025-05-07T20:25:42.3385784Z 
2025-05-07T20:25:42.3467610Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####4     |  55% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3467996Z 
2025-05-07T20:25:42.3468000Z 
2025-05-07T20:25:42.3468014Z 
2025-05-07T20:25:42.3468017Z 
2025-05-07T20:25:42.3468021Z 
2025-05-07T20:25:42.3468036Z 
2025-05-07T20:25:42.3468040Z 
2025-05-07T20:25:42.3468043Z 
2025-05-07T20:25:42.3469470Z 
2025-05-07T20:25:42.4468279Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4468695Z 
2025-05-07T20:25:42.4468702Z 
2025-05-07T20:25:42.4468706Z 
2025-05-07T20:25:42.4468710Z 
2025-05-07T20:25:42.4468714Z 
2025-05-07T20:25:42.4468717Z 
2025-05-07T20:25:42.4468721Z 
2025-05-07T20:25:42.4468725Z 
2025-05-07T20:25:42.4470543Z 
2025-05-07T20:25:42.4496904Z libcurand-10.3.7.77  | 39.9 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4497198Z 
2025-05-07T20:25:42.4497202Z 
2025-05-07T20:25:42.4497206Z 
2025-05-07T20:25:42.4497209Z 
2025-05-07T20:25:42.4497213Z 
2025-05-07T20:25:42.4497219Z 
2025-05-07T20:25:42.4497224Z 
2025-05-07T20:25:42.4497229Z 
2025-05-07T20:25:42.5482223Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######1    |  61% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.5482617Z 
2025-05-07T20:25:42.5482621Z 
2025-05-07T20:25:42.5482624Z 
2025-05-07T20:25:42.5482641Z 
2025-05-07T20:25:42.5482645Z 
2025-05-07T20:25:42.5482649Z 
2025-05-07T20:25:42.5482652Z 
2025-05-07T20:25:42.5482656Z 
2025-05-07T20:25:42.5483984Z 
2025-05-07T20:25:42.5579491Z libcurand-10.3.7.77  | 39.9 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.5579992Z 
2025-05-07T20:25:42.5579996Z 
2025-05-07T20:25:42.5580000Z 
2025-05-07T20:25:42.5580003Z 
2025-05-07T20:25:42.5580007Z 
2025-05-07T20:25:42.5580011Z 
2025-05-07T20:25:42.5580023Z 
2025-05-07T20:25:42.5580027Z 
2025-05-07T20:25:42.6483087Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######7    |  68% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6483390Z 
2025-05-07T20:25:42.6483394Z 
2025-05-07T20:25:42.6483398Z 
2025-05-07T20:25:42.6483409Z 
2025-05-07T20:25:42.6483413Z 
2025-05-07T20:25:42.6483417Z 
2025-05-07T20:25:42.6483421Z 
2025-05-07T20:25:42.6483424Z 
2025-05-07T20:25:42.6485223Z 
2025-05-07T20:25:42.6627024Z libcurand-10.3.7.77  | 39.9 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6627531Z 
2025-05-07T20:25:42.6627549Z 
2025-05-07T20:25:42.6627554Z 
2025-05-07T20:25:42.6627559Z 
2025-05-07T20:25:42.6627565Z 
2025-05-07T20:25:42.6627580Z 
2025-05-07T20:25:42.6627585Z 
2025-05-07T20:25:42.6627591Z 
2025-05-07T20:25:42.7485373Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######4   |  74% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7485780Z 
2025-05-07T20:25:42.7485786Z 
2025-05-07T20:25:42.7485790Z 
2025-05-07T20:25:42.7485794Z 
2025-05-07T20:25:42.7485797Z 
2025-05-07T20:25:42.7485801Z 
2025-05-07T20:25:42.7485805Z 
2025-05-07T20:25:42.7485808Z 
2025-05-07T20:25:42.7487047Z 
2025-05-07T20:25:42.7723552Z libcurand-10.3.7.77  | 39.9 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7724000Z 
2025-05-07T20:25:42.7724004Z 
2025-05-07T20:25:42.7724008Z 
2025-05-07T20:25:42.7724012Z 
2025-05-07T20:25:42.7724016Z 
2025-05-07T20:25:42.7724020Z 
2025-05-07T20:25:42.7724023Z 
2025-05-07T20:25:42.7724027Z 
2025-05-07T20:25:42.8503977Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########   |  81% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8504704Z 
2025-05-07T20:25:42.8504710Z 
2025-05-07T20:25:42.8504715Z 
2025-05-07T20:25:42.8504720Z 
2025-05-07T20:25:42.8504922Z 
2025-05-07T20:25:42.8504929Z 
2025-05-07T20:25:42.8504934Z 
2025-05-07T20:25:42.8504939Z 
2025-05-07T20:25:42.8507302Z 
2025-05-07T20:25:42.8784264Z libcurand-10.3.7.77  | 39.9 MB   | ###9       |  40% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8784600Z 
2025-05-07T20:25:42.8784605Z 
2025-05-07T20:25:42.8784610Z 
2025-05-07T20:25:42.8784624Z 
2025-05-07T20:25:42.8784629Z 
2025-05-07T20:25:42.8784634Z 
2025-05-07T20:25:42.8784639Z 
2025-05-07T20:25:42.8786633Z 
2025-05-07T20:25:42.9513963Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########6  |  87% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9514281Z 
2025-05-07T20:25:42.9514285Z 
2025-05-07T20:25:42.9514289Z 
2025-05-07T20:25:42.9514292Z 
2025-05-07T20:25:42.9514296Z 
2025-05-07T20:25:42.9514300Z 
2025-05-07T20:25:42.9514304Z 
2025-05-07T20:25:42.9514308Z 
2025-05-07T20:25:42.9515888Z 
2025-05-07T20:25:42.9785373Z libcurand-10.3.7.77  | 39.9 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9785669Z 
2025-05-07T20:25:42.9785683Z 
2025-05-07T20:25:42.9785687Z 
2025-05-07T20:25:42.9785691Z 
2025-05-07T20:25:42.9785695Z 
2025-05-07T20:25:42.9785699Z 
2025-05-07T20:25:42.9785703Z 
2025-05-07T20:25:42.9787084Z 
2025-05-07T20:25:43.0516081Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########3 |  94% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0516466Z 
2025-05-07T20:25:43.0516470Z 
2025-05-07T20:25:43.0516474Z 
2025-05-07T20:25:43.0516478Z 
2025-05-07T20:25:43.0516482Z 
2025-05-07T20:25:43.0516486Z 
2025-05-07T20:25:43.0516490Z 
2025-05-07T20:25:43.0516494Z 
2025-05-07T20:25:43.0517304Z 
2025-05-07T20:25:43.1516613Z libcurand-10.3.7.77  | 39.9 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1517032Z 
2025-05-07T20:25:43.1517036Z 
2025-05-07T20:25:43.1517040Z 
2025-05-07T20:25:43.1517044Z 
2025-05-07T20:25:43.1517048Z 
2025-05-07T20:25:43.1517067Z 
2025-05-07T20:25:43.1517079Z 
2025-05-07T20:25:43.1517083Z 
2025-05-07T20:25:43.1518849Z 
2025-05-07T20:25:43.2519369Z libcurand-10.3.7.77  | 39.9 MB   | ######5    |  65% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2519766Z 
2025-05-07T20:25:43.2519774Z 
2025-05-07T20:25:43.2519778Z 
2025-05-07T20:25:43.2519782Z 
2025-05-07T20:25:43.2519786Z 
2025-05-07T20:25:43.2519789Z 
2025-05-07T20:25:43.2519793Z 
2025-05-07T20:25:43.2519821Z 
2025-05-07T20:25:43.2519825Z 
2025-05-07T20:25:43.3529164Z libcurand-10.3.7.77  | 39.9 MB   | #######3   |  74% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3529475Z 
2025-05-07T20:25:43.3529479Z 
2025-05-07T20:25:43.3529483Z 
2025-05-07T20:25:43.3529487Z 
2025-05-07T20:25:43.3529491Z 
2025-05-07T20:25:43.3529495Z 
2025-05-07T20:25:43.3529499Z 
2025-05-07T20:25:43.3529503Z 
2025-05-07T20:25:43.3532438Z 
2025-05-07T20:25:43.4529403Z libcurand-10.3.7.77  | 39.9 MB   | ########3  |  83% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4529706Z 
2025-05-07T20:25:43.4529710Z 
2025-05-07T20:25:43.4529724Z 
2025-05-07T20:25:43.4529728Z 
2025-05-07T20:25:43.4529732Z 
2025-05-07T20:25:43.4529736Z 
2025-05-07T20:25:43.4529740Z 
2025-05-07T20:25:43.4529751Z 
2025-05-07T20:25:43.4529755Z 
2025-05-07T20:25:43.4749982Z libcurand-10.3.7.77  | 39.9 MB   | #########2 |  92% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4750274Z 
2025-05-07T20:25:43.4750278Z 
2025-05-07T20:25:43.4753033Z 
2025-05-07T20:25:44.2735527Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:44.2735817Z 
2025-05-07T20:25:44.2735822Z 
2025-05-07T20:25:44.2735825Z 
2025-05-07T20:25:44.2735829Z 
2025-05-07T20:25:44.2735833Z 
2025-05-07T20:25:44.2735845Z 
2025-05-07T20:25:44.2736397Z 
2025-05-07T20:25:44.3220728Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:44.3221127Z 
2025-05-07T20:25:44.3221140Z 
2025-05-07T20:25:44.3221146Z 
2025-05-07T20:25:44.3221151Z 
2025-05-07T20:25:44.3221155Z 
2025-05-07T20:25:44.3221159Z 
2025-05-07T20:25:44.3221394Z 
2025-05-07T20:25:44.3221398Z 
2025-05-07T20:25:44.3221401Z 
2025-05-07T20:25:44.3221405Z 
2025-05-07T20:25:44.4224308Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4224629Z 
2025-05-07T20:25:44.4224633Z 
2025-05-07T20:25:44.4224637Z 
2025-05-07T20:25:44.4224641Z 
2025-05-07T20:25:44.4224645Z 
2025-05-07T20:25:44.4224648Z 
2025-05-07T20:25:44.4224652Z 
2025-05-07T20:25:44.4224656Z 
2025-05-07T20:25:44.4224660Z 
2025-05-07T20:25:44.4224664Z 
2025-05-07T20:25:44.5225738Z gds-tools-1.11.1.6   | 37.8 MB   | 8          |   9% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5226111Z 
2025-05-07T20:25:44.5226115Z 
2025-05-07T20:25:44.5226119Z 
2025-05-07T20:25:44.5226123Z 
2025-05-07T20:25:44.5226127Z 
2025-05-07T20:25:44.5226131Z 
2025-05-07T20:25:44.5226134Z 
2025-05-07T20:25:44.5226138Z 
2025-05-07T20:25:44.5226142Z 
2025-05-07T20:25:44.5226146Z 
2025-05-07T20:25:44.6323799Z gds-tools-1.11.1.6   | 37.8 MB   | #7         |  18% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6324180Z 
2025-05-07T20:25:44.6324184Z 
2025-05-07T20:25:44.6324188Z 
2025-05-07T20:25:44.6324192Z 
2025-05-07T20:25:44.6324196Z 
2025-05-07T20:25:44.6324209Z 
2025-05-07T20:25:44.6324213Z 
2025-05-07T20:25:44.6324224Z 
2025-05-07T20:25:44.6324228Z 
2025-05-07T20:25:44.6324232Z 
2025-05-07T20:25:44.6434820Z gds-tools-1.11.1.6   | 37.8 MB   | ##6        |  27% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6439352Z 
2025-05-07T20:25:44.6882102Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:44.6882486Z 
2025-05-07T20:25:44.6882491Z 
2025-05-07T20:25:44.6882496Z 
2025-05-07T20:25:44.6882501Z 
2025-05-07T20:25:44.6882506Z 
2025-05-07T20:25:44.6882512Z 
2025-05-07T20:25:44.6882529Z 
2025-05-07T20:25:44.6886624Z 
2025-05-07T20:25:44.6923483Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6923817Z 
2025-05-07T20:25:44.6923832Z 
2025-05-07T20:25:44.6923837Z 
2025-05-07T20:25:44.6923842Z 
2025-05-07T20:25:44.6923865Z 
2025-05-07T20:25:44.6923870Z 
2025-05-07T20:25:44.6923875Z 
2025-05-07T20:25:44.6923880Z 
2025-05-07T20:25:44.6923886Z 
2025-05-07T20:25:44.6923891Z 
2025-05-07T20:25:44.6928463Z 
2025-05-07T20:25:44.7284634Z python-3.10.13       | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7284937Z 
2025-05-07T20:25:44.7284941Z 
2025-05-07T20:25:44.7284945Z 
2025-05-07T20:25:44.7284949Z 
2025-05-07T20:25:44.7284953Z 
2025-05-07T20:25:44.7284957Z 
2025-05-07T20:25:44.7284961Z 
2025-05-07T20:25:44.7284964Z 
2025-05-07T20:25:44.7284968Z 
2025-05-07T20:25:44.7284972Z 
2025-05-07T20:25:44.7284976Z 
2025-05-07T20:25:44.7286594Z 
2025-05-07T20:25:44.7409789Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7410207Z 
2025-05-07T20:25:44.7410212Z 
2025-05-07T20:25:44.7410216Z 
2025-05-07T20:25:44.7410219Z 
2025-05-07T20:25:44.7410223Z 
2025-05-07T20:25:44.7410227Z 
2025-05-07T20:25:44.7410230Z 
2025-05-07T20:25:44.7410234Z 
2025-05-07T20:25:44.7410254Z 
2025-05-07T20:25:44.7414332Z 
2025-05-07T20:25:44.7934412Z gds-tools-1.11.1.6   | 37.8 MB   | ###5       |  35% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7934715Z 
2025-05-07T20:25:44.7934727Z 
2025-05-07T20:25:44.7934731Z 
2025-05-07T20:25:44.7934734Z 
2025-05-07T20:25:44.7934738Z 
2025-05-07T20:25:44.7934742Z 
2025-05-07T20:25:44.7934745Z 
2025-05-07T20:25:44.7934749Z 
2025-05-07T20:25:44.7934753Z 
2025-05-07T20:25:44.7934757Z 
2025-05-07T20:25:44.7939775Z 
2025-05-07T20:25:44.8287165Z python-3.10.13       | 24.5 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8287504Z 
2025-05-07T20:25:44.8287508Z 
2025-05-07T20:25:44.8287512Z 
2025-05-07T20:25:44.8287515Z 
2025-05-07T20:25:44.8287519Z 
2025-05-07T20:25:44.8287523Z 
2025-05-07T20:25:44.8287526Z 
2025-05-07T20:25:44.8287530Z 
2025-05-07T20:25:44.8287534Z 
2025-05-07T20:25:44.8287538Z 
2025-05-07T20:25:44.8287541Z 
2025-05-07T20:25:44.8291042Z 
2025-05-07T20:25:44.8491286Z cuda-nvcc-tools-12.6 | 23.0 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8491958Z 
2025-05-07T20:25:44.8491965Z 
2025-05-07T20:25:44.8491970Z 
2025-05-07T20:25:44.8492477Z 
2025-05-07T20:25:44.8492486Z 
2025-05-07T20:25:44.8492491Z 
2025-05-07T20:25:44.8492496Z 
2025-05-07T20:25:44.8492501Z 
2025-05-07T20:25:44.8492507Z 
2025-05-07T20:25:44.8492512Z 
2025-05-07T20:25:44.8779718Z gds-tools-1.11.1.6   | 37.8 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8780124Z 
2025-05-07T20:25:44.8780128Z 
2025-05-07T20:25:44.8780132Z 
2025-05-07T20:25:44.8780136Z 
2025-05-07T20:25:44.8780139Z 
2025-05-07T20:25:44.8780150Z 
2025-05-07T20:25:44.8780154Z 
2025-05-07T20:25:44.8780158Z 
2025-05-07T20:25:44.8787157Z 
2025-05-07T20:25:44.8979178Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8979485Z 
2025-05-07T20:25:44.8979489Z 
2025-05-07T20:25:44.8979493Z 
2025-05-07T20:25:44.8979497Z 
2025-05-07T20:25:44.8979500Z 
2025-05-07T20:25:44.8979521Z 
2025-05-07T20:25:44.8979524Z 
2025-05-07T20:25:44.8979528Z 
2025-05-07T20:25:44.8979532Z 
2025-05-07T20:25:44.8979536Z 
2025-05-07T20:25:44.8982278Z 
2025-05-07T20:25:44.9352375Z python-3.10.13       | 24.5 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9352705Z 
2025-05-07T20:25:44.9352711Z 
2025-05-07T20:25:44.9352716Z 
2025-05-07T20:25:44.9352721Z 
2025-05-07T20:25:44.9352726Z 
2025-05-07T20:25:44.9352732Z 
2025-05-07T20:25:44.9352737Z 
2025-05-07T20:25:44.9352742Z 
2025-05-07T20:25:44.9352747Z 
2025-05-07T20:25:44.9352752Z 
2025-05-07T20:25:44.9352757Z 
2025-05-07T20:25:44.9356610Z 
2025-05-07T20:25:44.9374497Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9374816Z 
2025-05-07T20:25:44.9374821Z 
2025-05-07T20:25:44.9374833Z 
2025-05-07T20:25:44.9374837Z 
2025-05-07T20:25:44.9374841Z 
2025-05-07T20:25:44.9374844Z 
2025-05-07T20:25:44.9374848Z 
2025-05-07T20:25:44.9374852Z 
2025-05-07T20:25:44.9374866Z 
2025-05-07T20:25:44.9374871Z 
2025-05-07T20:25:44.9374875Z 
2025-05-07T20:25:44.9374878Z 
2025-05-07T20:25:44.9378150Z 
2025-05-07T20:25:44.9543073Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9543440Z 
2025-05-07T20:25:44.9543446Z 
2025-05-07T20:25:44.9543451Z 
2025-05-07T20:25:44.9543456Z 
2025-05-07T20:25:44.9543461Z 
2025-05-07T20:25:44.9543466Z 
2025-05-07T20:25:44.9543471Z 
2025-05-07T20:25:44.9543476Z 
2025-05-07T20:25:44.9543481Z 
2025-05-07T20:25:44.9543487Z 
2025-05-07T20:25:45.0058339Z gds-tools-1.11.1.6   | 37.8 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0058639Z 
2025-05-07T20:25:45.0058644Z 
2025-05-07T20:25:45.0058647Z 
2025-05-07T20:25:45.0058651Z 
2025-05-07T20:25:45.0058655Z 
2025-05-07T20:25:45.0058658Z 
2025-05-07T20:25:45.0058662Z 
2025-05-07T20:25:45.0058666Z 
2025-05-07T20:25:45.0058669Z 
2025-05-07T20:25:45.0058673Z 
2025-05-07T20:25:45.0062771Z 
2025-05-07T20:25:45.0355333Z python-3.10.13       | 24.5 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0355705Z 
2025-05-07T20:25:45.0355711Z 
2025-05-07T20:25:45.0355740Z 
2025-05-07T20:25:45.0355746Z 
2025-05-07T20:25:45.0355751Z 
2025-05-07T20:25:45.0355757Z 
2025-05-07T20:25:45.0355762Z 
2025-05-07T20:25:45.0355767Z 
2025-05-07T20:25:45.0355772Z 
2025-05-07T20:25:45.0355777Z 
2025-05-07T20:25:45.0355782Z 
2025-05-07T20:25:45.0362491Z 
2025-05-07T20:25:45.0379272Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###4       |  34% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0379598Z 
2025-05-07T20:25:45.0379603Z 
2025-05-07T20:25:45.0379607Z 
2025-05-07T20:25:45.0379611Z 
2025-05-07T20:25:45.0379615Z 
2025-05-07T20:25:45.0379619Z 
2025-05-07T20:25:45.0379623Z 
2025-05-07T20:25:45.0379627Z 
2025-05-07T20:25:45.0379630Z 
2025-05-07T20:25:45.0379634Z 
2025-05-07T20:25:45.0379638Z 
2025-05-07T20:25:45.0379641Z 
2025-05-07T20:25:45.0379740Z 
2025-05-07T20:25:45.0762195Z cuda-nvrtc-12.6.85   | 17.3 MB   | #3         |  14% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0762915Z 
2025-05-07T20:25:45.0762920Z 
2025-05-07T20:25:45.0762924Z 
2025-05-07T20:25:45.0763063Z 
2025-05-07T20:25:45.0763067Z 
2025-05-07T20:25:45.0763071Z 
2025-05-07T20:25:45.0763075Z 
2025-05-07T20:25:45.0763078Z 
2025-05-07T20:25:45.0763082Z 
2025-05-07T20:25:45.0763094Z 
2025-05-07T20:25:45.1183764Z gds-tools-1.11.1.6   | 37.8 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1184066Z 
2025-05-07T20:25:45.1184070Z 
2025-05-07T20:25:45.1184074Z 
2025-05-07T20:25:45.1184084Z 
2025-05-07T20:25:45.1184088Z 
2025-05-07T20:25:45.1184092Z 
2025-05-07T20:25:45.1184096Z 
2025-05-07T20:25:45.1184100Z 
2025-05-07T20:25:45.1184104Z 
2025-05-07T20:25:45.1184107Z 
2025-05-07T20:25:45.1186592Z 
2025-05-07T20:25:45.1357564Z python-3.10.13       | 24.5 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1357867Z 
2025-05-07T20:25:45.1357871Z 
2025-05-07T20:25:45.1357875Z 
2025-05-07T20:25:45.1357892Z 
2025-05-07T20:25:45.1357896Z 
2025-05-07T20:25:45.1357900Z 
2025-05-07T20:25:45.1357904Z 
2025-05-07T20:25:45.1357908Z 
2025-05-07T20:25:45.1357911Z 
2025-05-07T20:25:45.1357921Z 
2025-05-07T20:25:45.1357925Z 
2025-05-07T20:25:45.1361571Z 
2025-05-07T20:25:45.1382652Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1382971Z 
2025-05-07T20:25:45.1382975Z 
2025-05-07T20:25:45.1382979Z 
2025-05-07T20:25:45.1382983Z 
2025-05-07T20:25:45.1382986Z 
2025-05-07T20:25:45.1382990Z 
2025-05-07T20:25:45.1382994Z 
2025-05-07T20:25:45.1382998Z 
2025-05-07T20:25:45.1383002Z 
2025-05-07T20:25:45.1383005Z 
2025-05-07T20:25:45.1383009Z 
2025-05-07T20:25:45.1383013Z 
2025-05-07T20:25:45.1383020Z 
2025-05-07T20:25:45.1766461Z cuda-nvrtc-12.6.85   | 17.3 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1766760Z 
2025-05-07T20:25:45.1766764Z 
2025-05-07T20:25:45.1766768Z 
2025-05-07T20:25:45.1766771Z 
2025-05-07T20:25:45.1766786Z 
2025-05-07T20:25:45.1766790Z 
2025-05-07T20:25:45.1766794Z 
2025-05-07T20:25:45.1766797Z 
2025-05-07T20:25:45.1766801Z 
2025-05-07T20:25:45.1766804Z 
2025-05-07T20:25:45.2185496Z gds-tools-1.11.1.6   | 37.8 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2185801Z 
2025-05-07T20:25:45.2185805Z 
2025-05-07T20:25:45.2185809Z 
2025-05-07T20:25:45.2185813Z 
2025-05-07T20:25:45.2185817Z 
2025-05-07T20:25:45.2185829Z 
2025-05-07T20:25:45.2185833Z 
2025-05-07T20:25:45.2185837Z 
2025-05-07T20:25:45.2185841Z 
2025-05-07T20:25:45.2185845Z 
2025-05-07T20:25:45.2187600Z 
2025-05-07T20:25:45.2366544Z python-3.10.13       | 24.5 MB   | #####4     |  54% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2366839Z 
2025-05-07T20:25:45.2366844Z 
2025-05-07T20:25:45.2366848Z 
2025-05-07T20:25:45.2366852Z 
2025-05-07T20:25:45.2366856Z 
2025-05-07T20:25:45.2366860Z 
2025-05-07T20:25:45.2366864Z 
2025-05-07T20:25:45.2366867Z 
2025-05-07T20:25:45.2366871Z 
2025-05-07T20:25:45.2366875Z 
2025-05-07T20:25:45.2366886Z 
2025-05-07T20:25:45.2368293Z 
2025-05-07T20:25:45.2386071Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####8     |  58% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2386410Z 
2025-05-07T20:25:45.2386416Z 
2025-05-07T20:25:45.2386422Z 
2025-05-07T20:25:45.2386427Z 
2025-05-07T20:25:45.2386432Z 
2025-05-07T20:25:45.2386437Z 
2025-05-07T20:25:45.2386442Z 
2025-05-07T20:25:45.2386447Z 
2025-05-07T20:25:45.2386452Z 
2025-05-07T20:25:45.2386457Z 
2025-05-07T20:25:45.2386462Z 
2025-05-07T20:25:45.2386467Z 
2025-05-07T20:25:45.2388854Z 
2025-05-07T20:25:45.2904728Z cuda-nvrtc-12.6.85   | 17.3 MB   | ####2      |  43% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2905158Z 
2025-05-07T20:25:45.2905163Z 
2025-05-07T20:25:45.2905180Z 
2025-05-07T20:25:45.2905186Z 
2025-05-07T20:25:45.2905191Z 
2025-05-07T20:25:45.2905196Z 
2025-05-07T20:25:45.2905201Z 
2025-05-07T20:25:45.2905206Z 
2025-05-07T20:25:45.2905211Z 
2025-05-07T20:25:45.2905215Z 
2025-05-07T20:25:45.3267087Z gds-tools-1.11.1.6   | 37.8 MB   | #######5   |  75% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3267636Z 
2025-05-07T20:25:45.3267641Z 
2025-05-07T20:25:45.3267773Z 
2025-05-07T20:25:45.3267778Z 
2025-05-07T20:25:45.3267781Z 
2025-05-07T20:25:45.3267785Z 
2025-05-07T20:25:45.3267789Z 
2025-05-07T20:25:45.3267793Z 
2025-05-07T20:25:45.3267797Z 
2025-05-07T20:25:45.3267800Z 
2025-05-07T20:25:45.3269680Z 
2025-05-07T20:25:45.3389245Z python-3.10.13       | 24.5 MB   | ######4    |  64% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3389564Z 
2025-05-07T20:25:45.3389569Z 
2025-05-07T20:25:45.3389574Z 
2025-05-07T20:25:45.3389577Z 
2025-05-07T20:25:45.3389582Z 
2025-05-07T20:25:45.3389585Z 
2025-05-07T20:25:45.3389589Z 
2025-05-07T20:25:45.3389593Z 
2025-05-07T20:25:45.3389597Z 
2025-05-07T20:25:45.3389600Z 
2025-05-07T20:25:45.3389604Z 
2025-05-07T20:25:45.3389608Z 
2025-05-07T20:25:45.3393061Z 
2025-05-07T20:25:45.3495146Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####8     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3495536Z 
2025-05-07T20:25:45.3495540Z 
2025-05-07T20:25:45.3495544Z 
2025-05-07T20:25:45.3495547Z 
2025-05-07T20:25:45.3495551Z 
2025-05-07T20:25:45.3495568Z 
2025-05-07T20:25:45.3495572Z 
2025-05-07T20:25:45.3495576Z 
2025-05-07T20:25:45.3495580Z 
2025-05-07T20:25:45.3495583Z 
2025-05-07T20:25:45.3495587Z 
2025-05-07T20:25:45.3495591Z 
2025-05-07T20:25:45.3914813Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3915143Z 
2025-05-07T20:25:45.3915147Z 
2025-05-07T20:25:45.3915151Z 
2025-05-07T20:25:45.3915154Z 
2025-05-07T20:25:45.3915158Z 
2025-05-07T20:25:45.3915162Z 
2025-05-07T20:25:45.3915165Z 
2025-05-07T20:25:45.3915169Z 
2025-05-07T20:25:45.3915173Z 
2025-05-07T20:25:45.3915176Z 
2025-05-07T20:25:45.4374853Z gds-tools-1.11.1.6   | 37.8 MB   | ########2  |  83% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4375163Z 
2025-05-07T20:25:45.4375168Z 
2025-05-07T20:25:45.4375172Z 
2025-05-07T20:25:45.4375176Z 
2025-05-07T20:25:45.4375190Z 
2025-05-07T20:25:45.4375193Z 
2025-05-07T20:25:45.4375197Z 
2025-05-07T20:25:45.4375201Z 
2025-05-07T20:25:45.4375204Z 
2025-05-07T20:25:45.4375214Z 
2025-05-07T20:25:45.4377233Z 
2025-05-07T20:25:45.4389605Z python-3.10.13       | 24.5 MB   | #######4   |  74% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4390164Z 
2025-05-07T20:25:45.4390171Z 
2025-05-07T20:25:45.4390176Z 
2025-05-07T20:25:45.4390181Z 
2025-05-07T20:25:45.4390186Z 
2025-05-07T20:25:45.4390191Z 
2025-05-07T20:25:45.4390197Z 
2025-05-07T20:25:45.4390201Z 
2025-05-07T20:25:45.4390212Z 
2025-05-07T20:25:45.4390216Z 
2025-05-07T20:25:45.4390220Z 
2025-05-07T20:25:45.4390224Z 
2025-05-07T20:25:45.4392060Z 
2025-05-07T20:25:45.4497545Z cuda-nvrtc-12.6.85   | 17.3 MB   | #######3   |  74% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4497847Z 
2025-05-07T20:25:45.4497858Z 
2025-05-07T20:25:45.4497862Z 
2025-05-07T20:25:45.4497866Z 
2025-05-07T20:25:45.4497870Z 
2025-05-07T20:25:45.4497874Z 
2025-05-07T20:25:45.4497887Z 
2025-05-07T20:25:45.4497891Z 
2025-05-07T20:25:45.4497895Z 
2025-05-07T20:25:45.4497898Z 
2025-05-07T20:25:45.4497902Z 
2025-05-07T20:25:45.4505376Z 
2025-05-07T20:25:45.5010930Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########1  |  81% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5011326Z 
2025-05-07T20:25:45.5011332Z 
2025-05-07T20:25:45.5011337Z 
2025-05-07T20:25:45.5011342Z 
2025-05-07T20:25:45.5011347Z 
2025-05-07T20:25:45.5011353Z 
2025-05-07T20:25:45.5011358Z 
2025-05-07T20:25:45.5011363Z 
2025-05-07T20:25:45.5011368Z 
2025-05-07T20:25:45.5011374Z 
2025-05-07T20:25:45.5382592Z gds-tools-1.11.1.6   | 37.8 MB   | #########  |  90% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5382891Z 
2025-05-07T20:25:45.5382896Z 
2025-05-07T20:25:45.5382900Z 
2025-05-07T20:25:45.5382904Z 
2025-05-07T20:25:45.5382915Z 
2025-05-07T20:25:45.5382918Z 
2025-05-07T20:25:45.5382922Z 
2025-05-07T20:25:45.5382926Z 
2025-05-07T20:25:45.5382930Z 
2025-05-07T20:25:45.5382933Z 
2025-05-07T20:25:45.5384482Z 
2025-05-07T20:25:45.5478126Z python-3.10.13       | 24.5 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5478535Z 
2025-05-07T20:25:45.5478754Z 
2025-05-07T20:25:45.5478760Z 
2025-05-07T20:25:45.5478764Z 
2025-05-07T20:25:45.5478767Z 
2025-05-07T20:25:45.5478771Z 
2025-05-07T20:25:45.5478775Z 
2025-05-07T20:25:45.5478779Z 
2025-05-07T20:25:45.5478783Z 
2025-05-07T20:25:45.5478787Z 
2025-05-07T20:25:45.5478790Z 
2025-05-07T20:25:45.5478794Z 
2025-05-07T20:25:45.5487728Z 
2025-05-07T20:25:45.5541942Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########8  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5542257Z 
2025-05-07T20:25:45.5542261Z 
2025-05-07T20:25:45.5542264Z 
2025-05-07T20:25:45.5542268Z 
2025-05-07T20:25:45.5542272Z 
2025-05-07T20:25:45.5542276Z 
2025-05-07T20:25:45.5542280Z 
2025-05-07T20:25:45.5542284Z 
2025-05-07T20:25:45.5542287Z 
2025-05-07T20:25:45.5542291Z 
2025-05-07T20:25:45.5542295Z 
2025-05-07T20:25:45.5542307Z 
2025-05-07T20:25:45.6047093Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########2 |  93% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6047481Z 
2025-05-07T20:25:45.6047485Z 
2025-05-07T20:25:45.6047501Z 
2025-05-07T20:25:45.6047513Z 
2025-05-07T20:25:45.6047517Z 
2025-05-07T20:25:45.6047521Z 
2025-05-07T20:25:45.6047525Z 
2025-05-07T20:25:45.6047529Z 
2025-05-07T20:25:45.6047533Z 
2025-05-07T20:25:45.6048903Z 
2025-05-07T20:25:45.6384790Z gds-tools-1.11.1.6   | 37.8 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6385237Z 
2025-05-07T20:25:45.6385243Z 
2025-05-07T20:25:45.6385248Z 
2025-05-07T20:25:45.6385253Z 
2025-05-07T20:25:45.6385258Z 
2025-05-07T20:25:45.6385263Z 
2025-05-07T20:25:45.6385269Z 
2025-05-07T20:25:45.6385274Z 
2025-05-07T20:25:45.6385279Z 
2025-05-07T20:25:45.6385284Z 
2025-05-07T20:25:45.6386847Z 
2025-05-07T20:25:46.0918346Z python-3.10.13       | 24.5 MB   | #########6 |  97% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0918805Z 
2025-05-07T20:25:46.0919395Z 
2025-05-07T20:25:46.2301262Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:46.2301643Z 
2025-05-07T20:25:46.2301649Z 
2025-05-07T20:25:46.2301689Z 
2025-05-07T20:25:46.2301695Z 
2025-05-07T20:25:46.2301701Z 
2025-05-07T20:25:46.2301706Z 
2025-05-07T20:25:46.2301711Z 
2025-05-07T20:25:46.2301716Z 
2025-05-07T20:25:46.2301721Z 
2025-05-07T20:25:46.2301726Z 
2025-05-07T20:25:46.2301741Z 
2025-05-07T20:25:46.2301745Z 
2025-05-07T20:25:46.2305419Z 
2025-05-07T20:25:46.2962158Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2962460Z 
2025-05-07T20:25:46.2962472Z 
2025-05-07T20:25:46.2962476Z 
2025-05-07T20:25:46.2962480Z 
2025-05-07T20:25:46.2962483Z 
2025-05-07T20:25:46.2962487Z 
2025-05-07T20:25:46.2962491Z 
2025-05-07T20:25:46.2962495Z 
2025-05-07T20:25:46.2962499Z 
2025-05-07T20:25:46.2962503Z 
2025-05-07T20:25:46.2962507Z 
2025-05-07T20:25:46.2962510Z 
2025-05-07T20:25:46.2962514Z 
2025-05-07T20:25:46.2964122Z 
2025-05-07T20:25:46.3773174Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3773619Z 
2025-05-07T20:25:46.3773648Z 
2025-05-07T20:25:46.3773653Z 
2025-05-07T20:25:46.3773659Z 
2025-05-07T20:25:46.3773664Z 
2025-05-07T20:25:46.3773669Z 
2025-05-07T20:25:46.3773673Z 
2025-05-07T20:25:46.3773676Z 
2025-05-07T20:25:46.3773680Z 
2025-05-07T20:25:46.3773684Z 
2025-05-07T20:25:46.3773688Z 
2025-05-07T20:25:46.3776081Z 
2025-05-07T20:25:46.3961402Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3961847Z 
2025-05-07T20:25:46.3961852Z 
2025-05-07T20:25:46.3961858Z 
2025-05-07T20:25:46.3961863Z 
2025-05-07T20:25:46.3961879Z 
2025-05-07T20:25:46.3961884Z 
2025-05-07T20:25:46.3961889Z 
2025-05-07T20:25:46.3961895Z 
2025-05-07T20:25:46.3961900Z 
2025-05-07T20:25:46.3961906Z 
2025-05-07T20:25:46.3961910Z 
2025-05-07T20:25:46.4081265Z python-3.10.13       | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4081930Z 
2025-05-07T20:25:46.4081936Z 
2025-05-07T20:25:46.4081942Z 
2025-05-07T20:25:46.4081947Z 
2025-05-07T20:25:46.4081953Z 
2025-05-07T20:25:46.4082113Z 
2025-05-07T20:25:46.4082120Z 
2025-05-07T20:25:46.4082125Z 
2025-05-07T20:25:46.4082130Z 
2025-05-07T20:25:46.4082135Z 
2025-05-07T20:25:46.4082141Z 
2025-05-07T20:25:46.4082146Z 
2025-05-07T20:25:46.4082151Z 
2025-05-07T20:25:46.4082156Z 
2025-05-07T20:25:46.4082175Z 
2025-05-07T20:25:46.4285930Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4286372Z 
2025-05-07T20:25:46.4286378Z 
2025-05-07T20:25:46.4286383Z 
2025-05-07T20:25:46.4286388Z 
2025-05-07T20:25:46.4286393Z 
2025-05-07T20:25:46.4286398Z 
2025-05-07T20:25:46.4286403Z 
2025-05-07T20:25:46.4286408Z 
2025-05-07T20:25:46.4286413Z 
2025-05-07T20:25:46.4286428Z 
2025-05-07T20:25:46.4286433Z 
2025-05-07T20:25:46.4286438Z 
2025-05-07T20:25:46.4286443Z 
2025-05-07T20:25:46.4286448Z 
2025-05-07T20:25:46.4286468Z 
2025-05-07T20:25:46.4290975Z 
2025-05-07T20:25:46.4470854Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4471293Z 
2025-05-07T20:25:46.4471297Z 
2025-05-07T20:25:46.4471300Z 
2025-05-07T20:25:46.4471304Z 
2025-05-07T20:25:46.4471308Z 
2025-05-07T20:25:46.4471311Z 
2025-05-07T20:25:46.4471315Z 
2025-05-07T20:25:46.4471318Z 
2025-05-07T20:25:46.4471322Z 
2025-05-07T20:25:46.4471325Z 
2025-05-07T20:25:46.4471329Z 
2025-05-07T20:25:46.4471333Z 
2025-05-07T20:25:46.4471336Z 
2025-05-07T20:25:46.4472765Z 
2025-05-07T20:25:46.5083465Z libnvjitlink-12.6.85 | 14.9 MB   | #6         |  17% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.5083909Z 
2025-05-07T20:25:46.5083914Z 
2025-05-07T20:25:46.5083919Z 
2025-05-07T20:25:46.5083924Z 
2025-05-07T20:25:46.5083930Z 
2025-05-07T20:25:46.5083936Z 
2025-05-07T20:25:46.5083941Z 
2025-05-07T20:25:46.5083946Z 
2025-05-07T20:25:46.5083951Z 
2025-05-07T20:25:46.5083956Z 
2025-05-07T20:25:46.5084004Z 
2025-05-07T20:25:46.5084009Z 
2025-05-07T20:25:46.5084014Z 
2025-05-07T20:25:46.5084019Z 
2025-05-07T20:25:46.5084029Z 
2025-05-07T20:25:46.5288297Z cuda-nvcc-dev_linux- | 10.8 MB   | ###4       |  34% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.5288771Z 
2025-05-07T20:25:46.5288777Z 
2025-05-07T20:25:46.5288782Z 
2025-05-07T20:25:46.5288787Z 
2025-05-07T20:25:46.5288792Z 
2025-05-07T20:25:46.5288798Z 
2025-05-07T20:25:46.5288803Z 
2025-05-07T20:25:46.5288808Z 
2025-05-07T20:25:46.5288813Z 
2025-05-07T20:25:46.5288818Z 
2025-05-07T20:25:46.5288823Z 
2025-05-07T20:25:46.5288828Z 
2025-05-07T20:25:46.5288833Z 
2025-05-07T20:25:46.5288838Z 
2025-05-07T20:25:46.5288843Z 
2025-05-07T20:25:46.5288849Z 
2025-05-07T20:25:46.5474846Z cuda-nvvm-tools-12.6 | 10.4 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.5475209Z 
2025-05-07T20:25:46.5475213Z 
2025-05-07T20:25:46.5475217Z 
2025-05-07T20:25:46.5475221Z 
2025-05-07T20:25:46.5475240Z 
2025-05-07T20:25:46.5475243Z 
2025-05-07T20:25:46.5475247Z 
2025-05-07T20:25:46.5475250Z 
2025-05-07T20:25:46.5475254Z 
2025-05-07T20:25:46.5475265Z 
2025-05-07T20:25:46.5475275Z 
2025-05-07T20:25:46.5475279Z 
2025-05-07T20:25:46.5475283Z 
2025-05-07T20:25:46.5478949Z 
2025-05-07T20:25:46.6390984Z libnvjitlink-12.6.85 | 14.9 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6391348Z 
2025-05-07T20:25:46.6391353Z 
2025-05-07T20:25:46.6391357Z 
2025-05-07T20:25:46.6391361Z 
2025-05-07T20:25:46.6391364Z 
2025-05-07T20:25:46.6391368Z 
2025-05-07T20:25:46.6391372Z 
2025-05-07T20:25:46.6391376Z 
2025-05-07T20:25:46.6391387Z 
2025-05-07T20:25:46.6391391Z 
2025-05-07T20:25:46.6391394Z 
2025-05-07T20:25:46.6391398Z 
2025-05-07T20:25:46.6391401Z 
2025-05-07T20:25:46.6391406Z 
2025-05-07T20:25:46.6391409Z 
2025-05-07T20:25:46.6392163Z 
2025-05-07T20:25:46.6400666Z cuda-nvvm-tools-12.6 | 10.4 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6401345Z 
2025-05-07T20:25:46.6401349Z 
2025-05-07T20:25:46.6401353Z 
2025-05-07T20:25:46.6401357Z 
2025-05-07T20:25:46.6401361Z 
2025-05-07T20:25:46.6401495Z 
2025-05-07T20:25:46.6401500Z 
2025-05-07T20:25:46.6401503Z 
2025-05-07T20:25:46.6401507Z 
2025-05-07T20:25:46.6401510Z 
2025-05-07T20:25:46.6401514Z 
2025-05-07T20:25:46.6401517Z 
2025-05-07T20:25:46.6401521Z 
2025-05-07T20:25:46.6401524Z 
2025-05-07T20:25:46.6401528Z 
2025-05-07T20:25:46.6476780Z cuda-nvcc-dev_linux- | 10.8 MB   | ######8    |  69% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.6477136Z 
2025-05-07T20:25:46.6477140Z 
2025-05-07T20:25:46.6477143Z 
2025-05-07T20:25:46.6477147Z 
2025-05-07T20:25:46.6477151Z 
2025-05-07T20:25:46.6477154Z 
2025-05-07T20:25:46.6477158Z 
2025-05-07T20:25:46.6477162Z 
2025-05-07T20:25:46.6477166Z 
2025-05-07T20:25:46.6477178Z 
2025-05-07T20:25:46.6477182Z 
2025-05-07T20:25:46.6477189Z 
2025-05-07T20:25:46.6477193Z 
2025-05-07T20:25:46.6478921Z 
2025-05-07T20:25:46.7426489Z libnvjitlink-12.6.85 | 14.9 MB   | #####2     |  53% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7426991Z 
2025-05-07T20:25:46.7426996Z 
2025-05-07T20:25:46.7427017Z 
2025-05-07T20:25:46.7427023Z 
2025-05-07T20:25:46.7427028Z 
2025-05-07T20:25:46.7427033Z 
2025-05-07T20:25:46.7427039Z 
2025-05-07T20:25:46.7427043Z 
2025-05-07T20:25:46.7427049Z 
2025-05-07T20:25:46.7427054Z 
2025-05-07T20:25:46.7427059Z 
2025-05-07T20:25:46.7427064Z 
2025-05-07T20:25:46.7427069Z 
2025-05-07T20:25:46.7427075Z 
2025-05-07T20:25:46.7432613Z 
2025-05-07T20:25:46.7445366Z cuda-nvcc-dev_linux- | 10.8 MB   | #########8 |  98% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7445690Z 
2025-05-07T20:25:46.7445694Z 
2025-05-07T20:25:46.7445698Z 
2025-05-07T20:25:46.7445701Z 
2025-05-07T20:25:46.7445705Z 
2025-05-07T20:25:46.7445709Z 
2025-05-07T20:25:46.7445712Z 
2025-05-07T20:25:46.7445716Z 
2025-05-07T20:25:46.7445719Z 
2025-05-07T20:25:46.7445731Z 
2025-05-07T20:25:46.7445735Z 
2025-05-07T20:25:46.7445748Z 
2025-05-07T20:25:46.7445752Z 
2025-05-07T20:25:46.7445756Z 
2025-05-07T20:25:46.7445760Z 
2025-05-07T20:25:46.7447409Z 
2025-05-07T20:25:46.7481616Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########2  |  83% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7482009Z 
2025-05-07T20:25:46.7482013Z 
2025-05-07T20:25:46.7482017Z 
2025-05-07T20:25:46.7482020Z 
2025-05-07T20:25:46.7482024Z 
2025-05-07T20:25:46.7482028Z 
2025-05-07T20:25:46.7482032Z 
2025-05-07T20:25:46.7482035Z 
2025-05-07T20:25:46.7482039Z 
2025-05-07T20:25:46.7482042Z 
2025-05-07T20:25:46.7482046Z 
2025-05-07T20:25:46.7482050Z 
2025-05-07T20:25:46.7482054Z 
2025-05-07T20:25:46.7482058Z 
2025-05-07T20:25:46.8483238Z libnvjitlink-12.6.85 | 14.9 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.8483694Z 
2025-05-07T20:25:46.8483710Z 
2025-05-07T20:25:46.8483716Z 
2025-05-07T20:25:46.8483721Z 
2025-05-07T20:25:46.8483726Z 
2025-05-07T20:25:46.8483732Z 
2025-05-07T20:25:46.8483737Z 
2025-05-07T20:25:46.8483759Z 
2025-05-07T20:25:46.8483763Z 
2025-05-07T20:25:46.8483767Z 
2025-05-07T20:25:46.8483771Z 
2025-05-07T20:25:46.8483775Z 
2025-05-07T20:25:46.8483789Z 
2025-05-07T20:25:46.8489433Z 
2025-05-07T20:25:46.8731010Z libnvjitlink-12.6.85 | 14.9 MB   | #########1 |  92% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.8731346Z 
2025-05-07T20:25:46.8731350Z 
2025-05-07T20:25:46.8731354Z 
2025-05-07T20:25:46.8731358Z 
2025-05-07T20:25:46.8731361Z 
2025-05-07T20:25:46.8731365Z 
2025-05-07T20:25:46.8731369Z 
2025-05-07T20:25:46.8731373Z 
2025-05-07T20:25:46.8731391Z 
2025-05-07T20:25:46.8731397Z 
2025-05-07T20:25:46.9168497Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.9168802Z 
2025-05-07T20:25:46.9168816Z 
2025-05-07T20:25:46.9168819Z 
2025-05-07T20:25:46.9168823Z 
2025-05-07T20:25:46.9168827Z 
2025-05-07T20:25:46.9168831Z 
2025-05-07T20:25:46.9168836Z 
2025-05-07T20:25:46.9168839Z 
2025-05-07T20:25:46.9168843Z 
2025-05-07T20:25:46.9169164Z 
2025-05-07T20:25:46.9169170Z 
2025-05-07T20:25:46.9169175Z 
2025-05-07T20:25:46.9169178Z 
2025-05-07T20:25:46.9169182Z 
2025-05-07T20:25:46.9169186Z 
2025-05-07T20:25:46.9169325Z 
2025-05-07T20:25:46.9169329Z 
2025-05-07T20:25:47.0169901Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.0170313Z 
2025-05-07T20:25:47.0170317Z 
2025-05-07T20:25:47.0170329Z 
2025-05-07T20:25:47.0170333Z 
2025-05-07T20:25:47.0170337Z 
2025-05-07T20:25:47.0170340Z 
2025-05-07T20:25:47.0170344Z 
2025-05-07T20:25:47.0170348Z 
2025-05-07T20:25:47.0170351Z 
2025-05-07T20:25:47.0170355Z 
2025-05-07T20:25:47.0170359Z 
2025-05-07T20:25:47.0170372Z 
2025-05-07T20:25:47.0170375Z 
2025-05-07T20:25:47.0170379Z 
2025-05-07T20:25:47.0170383Z 
2025-05-07T20:25:47.0170386Z 
2025-05-07T20:25:47.0170390Z 
2025-05-07T20:25:47.0835338Z cuda-sanitizer-api-1 | 8.9 MB    | ##8        |  29% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.0835764Z 
2025-05-07T20:25:47.0835770Z 
2025-05-07T20:25:47.0835774Z 
2025-05-07T20:25:47.0835778Z 
2025-05-07T20:25:47.0835782Z 
2025-05-07T20:25:47.0835785Z 
2025-05-07T20:25:47.0835798Z 
2025-05-07T20:25:47.0835802Z 
2025-05-07T20:25:47.0835806Z 
2025-05-07T20:25:47.0835809Z 
2025-05-07T20:25:47.0835813Z 
2025-05-07T20:25:47.0835817Z 
2025-05-07T20:25:47.0835820Z 
2025-05-07T20:25:47.0835824Z 
2025-05-07T20:25:47.0843616Z 
2025-05-07T20:25:47.1175231Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.1175667Z 
2025-05-07T20:25:47.1175672Z 
2025-05-07T20:25:47.1175677Z 
2025-05-07T20:25:47.1175682Z 
2025-05-07T20:25:47.1175687Z 
2025-05-07T20:25:47.1175692Z 
2025-05-07T20:25:47.1175697Z 
2025-05-07T20:25:47.1175702Z 
2025-05-07T20:25:47.1175708Z 
2025-05-07T20:25:47.1175713Z 
2025-05-07T20:25:47.1175718Z 
2025-05-07T20:25:47.1175731Z 
2025-05-07T20:25:47.1175736Z 
2025-05-07T20:25:47.1175741Z 
2025-05-07T20:25:47.1175746Z 
2025-05-07T20:25:47.1175771Z 
2025-05-07T20:25:47.1175776Z 
2025-05-07T20:25:47.1271024Z cuda-sanitizer-api-1 | 8.9 MB    | ######6    |  66% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.1271503Z 
2025-05-07T20:25:47.1271507Z 
2025-05-07T20:25:47.1271511Z 
2025-05-07T20:25:47.1271515Z 
2025-05-07T20:25:47.1271518Z 
2025-05-07T20:25:47.1271522Z 
2025-05-07T20:25:47.1271526Z 
2025-05-07T20:25:47.1271530Z 
2025-05-07T20:25:47.1271533Z 
2025-05-07T20:25:47.1271537Z 
2025-05-07T20:25:47.1271540Z 
2025-05-07T20:25:47.1271544Z 
2025-05-07T20:25:47.1271548Z 
2025-05-07T20:25:47.1271551Z 
2025-05-07T20:25:47.1271555Z 
2025-05-07T20:25:47.1271558Z 
2025-05-07T20:25:47.1271562Z 
2025-05-07T20:25:47.1275978Z 
2025-05-07T20:25:47.2036008Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2036344Z 
2025-05-07T20:25:47.2036348Z 
2025-05-07T20:25:47.2036351Z 
2025-05-07T20:25:47.2036355Z 
2025-05-07T20:25:47.2036359Z 
2025-05-07T20:25:47.2036379Z 
2025-05-07T20:25:47.2036392Z 
2025-05-07T20:25:47.2036396Z 
2025-05-07T20:25:47.2036400Z 
2025-05-07T20:25:47.2036406Z 
2025-05-07T20:25:47.2036412Z 
2025-05-07T20:25:47.2036429Z 
2025-05-07T20:25:47.2036435Z 
2025-05-07T20:25:47.2036440Z 
2025-05-07T20:25:47.2036445Z 
2025-05-07T20:25:47.2039225Z 
2025-05-07T20:25:47.2212325Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2212717Z 
2025-05-07T20:25:47.2212721Z 
2025-05-07T20:25:47.2212725Z 
2025-05-07T20:25:47.2212728Z 
2025-05-07T20:25:47.2212732Z 
2025-05-07T20:25:47.2212736Z 
2025-05-07T20:25:47.2212740Z 
2025-05-07T20:25:47.2212743Z 
2025-05-07T20:25:47.2212747Z 
2025-05-07T20:25:47.2212751Z 
2025-05-07T20:25:47.2212755Z 
2025-05-07T20:25:47.2212759Z 
2025-05-07T20:25:47.2212762Z 
2025-05-07T20:25:47.2212766Z 
2025-05-07T20:25:47.2212770Z 
2025-05-07T20:25:47.2212774Z 
2025-05-07T20:25:47.2215543Z 
2025-05-07T20:25:47.2274215Z cuda-sanitizer-api-1 | 8.9 MB    | #########9 | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2274829Z 
2025-05-07T20:25:47.2274833Z 
2025-05-07T20:25:47.2274847Z 
2025-05-07T20:25:47.2274977Z 
2025-05-07T20:25:47.2274982Z 
2025-05-07T20:25:47.2274985Z 
2025-05-07T20:25:47.2274989Z 
2025-05-07T20:25:47.2274993Z 
2025-05-07T20:25:47.2274997Z 
2025-05-07T20:25:47.2275001Z 
2025-05-07T20:25:47.2275004Z 
2025-05-07T20:25:47.2275008Z 
2025-05-07T20:25:47.2275012Z 
2025-05-07T20:25:47.2275015Z 
2025-05-07T20:25:47.2275019Z 
2025-05-07T20:25:47.2275023Z 
2025-05-07T20:25:47.2275026Z 
2025-05-07T20:25:47.2275030Z 
2025-05-07T20:25:47.2578236Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###9       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2578576Z 
2025-05-07T20:25:47.2578580Z 
2025-05-07T20:25:47.2578584Z 
2025-05-07T20:25:47.2578587Z 
2025-05-07T20:25:47.2578591Z 
2025-05-07T20:25:47.2578595Z 
2025-05-07T20:25:47.2578598Z 
2025-05-07T20:25:47.2578602Z 
2025-05-07T20:25:47.2578605Z 
2025-05-07T20:25:47.2578620Z 
2025-05-07T20:25:47.2578630Z 
2025-05-07T20:25:47.2578634Z 
2025-05-07T20:25:47.2578638Z 
2025-05-07T20:25:47.2578641Z 
2025-05-07T20:25:47.2578645Z 
2025-05-07T20:25:47.2578656Z 
2025-05-07T20:25:47.2578660Z 
2025-05-07T20:25:47.2578663Z 
2025-05-07T20:25:47.2583695Z 
2025-05-07T20:25:47.3277048Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3277436Z 
2025-05-07T20:25:47.3277440Z 
2025-05-07T20:25:47.3277444Z 
2025-05-07T20:25:47.3277448Z 
2025-05-07T20:25:47.3277452Z 
2025-05-07T20:25:47.3277456Z 
2025-05-07T20:25:47.3277460Z 
2025-05-07T20:25:47.3277464Z 
2025-05-07T20:25:47.3277468Z 
2025-05-07T20:25:47.3277471Z 
2025-05-07T20:25:47.3277475Z 
2025-05-07T20:25:47.3277479Z 
2025-05-07T20:25:47.3277483Z 
2025-05-07T20:25:47.3277486Z 
2025-05-07T20:25:47.3277490Z 
2025-05-07T20:25:47.3277494Z 
2025-05-07T20:25:47.3277497Z 
2025-05-07T20:25:47.3278823Z 
2025-05-07T20:25:47.3578950Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########2  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3579309Z 
2025-05-07T20:25:47.3579314Z 
2025-05-07T20:25:47.3579318Z 
2025-05-07T20:25:47.3579322Z 
2025-05-07T20:25:47.3579334Z 
2025-05-07T20:25:47.3579338Z 
2025-05-07T20:25:47.3579341Z 
2025-05-07T20:25:47.3579345Z 
2025-05-07T20:25:47.3579349Z 
2025-05-07T20:25:47.3579352Z 
2025-05-07T20:25:47.3579356Z 
2025-05-07T20:25:47.3579359Z 
2025-05-07T20:25:47.3579363Z 
2025-05-07T20:25:47.3579367Z 
2025-05-07T20:25:47.3579370Z 
2025-05-07T20:25:47.3579374Z 
2025-05-07T20:25:47.3579377Z 
2025-05-07T20:25:47.3579381Z 
2025-05-07T20:25:47.3581941Z 
2025-05-07T20:25:47.4174314Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4174617Z 
2025-05-07T20:25:47.4174621Z 
2025-05-07T20:25:47.4174625Z 
2025-05-07T20:25:47.4174628Z 
2025-05-07T20:25:47.4174641Z 
2025-05-07T20:25:47.4174645Z 
2025-05-07T20:25:47.4174649Z 
2025-05-07T20:25:47.4174652Z 
2025-05-07T20:25:47.4174656Z 
2025-05-07T20:25:47.4174660Z 
2025-05-07T20:25:47.4174679Z 
2025-05-07T20:25:47.4174683Z 
2025-05-07T20:25:47.4174687Z 
2025-05-07T20:25:47.4174690Z 
2025-05-07T20:25:47.5009018Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5009411Z 
2025-05-07T20:25:47.5009416Z 
2025-05-07T20:25:47.5009420Z 
2025-05-07T20:25:47.5009424Z 
2025-05-07T20:25:47.5009427Z 
2025-05-07T20:25:47.5009431Z 
2025-05-07T20:25:47.5009435Z 
2025-05-07T20:25:47.5009439Z 
2025-05-07T20:25:47.5009443Z 
2025-05-07T20:25:47.5009447Z 
2025-05-07T20:25:47.5009451Z 
2025-05-07T20:25:47.5009455Z 
2025-05-07T20:25:47.5009458Z 
2025-05-07T20:25:47.5009462Z 
2025-05-07T20:25:47.5009466Z 
2025-05-07T20:25:47.5009469Z 
2025-05-07T20:25:47.5009481Z 
2025-05-07T20:25:47.5009485Z 
2025-05-07T20:25:47.5009489Z 
2025-05-07T20:25:47.5076979Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5077310Z 
2025-05-07T20:25:47.5077328Z 
2025-05-07T20:25:47.5077334Z 
2025-05-07T20:25:47.5077614Z 
2025-05-07T20:25:47.5077617Z 
2025-05-07T20:25:47.5077621Z 
2025-05-07T20:25:47.5077625Z 
2025-05-07T20:25:47.5077629Z 
2025-05-07T20:25:47.5077764Z 
2025-05-07T20:25:47.5077769Z 
2025-05-07T20:25:47.5077773Z 
2025-05-07T20:25:47.5077777Z 
2025-05-07T20:25:47.5077781Z 
2025-05-07T20:25:47.5077784Z 
2025-05-07T20:25:47.5077788Z 
2025-05-07T20:25:47.5077792Z 
2025-05-07T20:25:47.5077795Z 
2025-05-07T20:25:47.6094595Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6094990Z 
2025-05-07T20:25:47.6094995Z 
2025-05-07T20:25:47.6094999Z 
2025-05-07T20:25:47.6095003Z 
2025-05-07T20:25:47.6095006Z 
2025-05-07T20:25:47.6095010Z 
2025-05-07T20:25:47.6095014Z 
2025-05-07T20:25:47.6095018Z 
2025-05-07T20:25:47.6095022Z 
2025-05-07T20:25:47.6095026Z 
2025-05-07T20:25:47.6095037Z 
2025-05-07T20:25:47.6095041Z 
2025-05-07T20:25:47.6095045Z 
2025-05-07T20:25:47.6095048Z 
2025-05-07T20:25:47.6095053Z 
2025-05-07T20:25:47.6095056Z 
2025-05-07T20:25:47.6095086Z 
2025-05-07T20:25:47.6095090Z 
2025-05-07T20:25:48.3950107Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.3950460Z 
2025-05-07T20:25:48.3950465Z 
2025-05-07T20:25:48.3950468Z 
2025-05-07T20:25:48.3950472Z 
2025-05-07T20:25:48.3950476Z 
2025-05-07T20:25:48.3950480Z 
2025-05-07T20:25:49.3697209Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:49.3697525Z 
2025-05-07T20:25:49.3697530Z 
2025-05-07T20:25:49.3697534Z 
2025-05-07T20:25:49.3697537Z 
2025-05-07T20:25:49.3697541Z 
2025-05-07T20:25:49.7778330Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:49.7778630Z 
2025-05-07T20:25:49.7778635Z 
2025-05-07T20:25:49.7778638Z 
2025-05-07T20:25:49.7778642Z 
2025-05-07T20:25:49.7778646Z 
2025-05-07T20:25:49.7778650Z 
2025-05-07T20:25:49.7778662Z 
2025-05-07T20:25:49.7778667Z 
2025-05-07T20:25:50.2642806Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2643171Z 
2025-05-07T20:25:50.2643175Z 
2025-05-07T20:25:50.2643188Z 
2025-05-07T20:25:50.2643204Z 
2025-05-07T20:25:50.2643208Z 
2025-05-07T20:25:50.2643212Z 
2025-05-07T20:25:50.2643216Z 
2025-05-07T20:25:50.4483974Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:50.4484274Z 
2025-05-07T20:25:50.4484278Z 
2025-05-07T20:25:50.4484292Z 
2025-05-07T20:25:50.4484296Z 
2025-05-07T20:25:50.4484300Z 
2025-05-07T20:25:50.4484303Z 
2025-05-07T20:25:50.4484307Z 
2025-05-07T20:25:50.4484311Z 
2025-05-07T20:25:50.4484321Z 
2025-05-07T20:25:50.4613858Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6209146Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:50.6209483Z 
2025-05-07T20:25:50.6209488Z 
2025-05-07T20:25:50.6209492Z 
2025-05-07T20:25:50.6209505Z 
2025-05-07T20:25:50.6209509Z 
2025-05-07T20:25:50.6209513Z 
2025-05-07T20:25:50.6209545Z 
2025-05-07T20:25:50.6209550Z 
2025-05-07T20:25:50.6209555Z 
2025-05-07T20:25:50.6209559Z 
2025-05-07T20:25:50.6209566Z 
2025-05-07T20:25:50.6209570Z 
2025-05-07T20:25:50.6209590Z 
2025-05-07T20:25:50.9102234Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9102559Z 
2025-05-07T20:25:50.9102563Z 
2025-05-07T20:25:50.9102566Z 
2025-05-07T20:25:50.9102570Z 
2025-05-07T20:25:50.9102574Z 
2025-05-07T20:25:50.9102578Z 
2025-05-07T20:25:50.9102582Z 
2025-05-07T20:25:50.9102585Z 
2025-05-07T20:25:50.9102589Z 
2025-05-07T20:25:50.9102593Z 
2025-05-07T20:25:50.9102597Z 
2025-05-07T20:25:50.9102604Z 
2025-05-07T20:25:51.2212076Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.2212401Z 
2025-05-07T20:25:51.2212405Z 
2025-05-07T20:25:51.2212409Z 
2025-05-07T20:25:51.2212413Z 
2025-05-07T20:25:51.2212417Z 
2025-05-07T20:25:51.2212421Z 
2025-05-07T20:25:51.2212424Z 
2025-05-07T20:25:51.2212712Z 
2025-05-07T20:25:51.2212716Z 
2025-05-07T20:25:51.2212725Z 
2025-05-07T20:25:51.5905352Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.5905687Z 
2025-05-07T20:25:51.5905691Z 
2025-05-07T20:25:51.5905695Z 
2025-05-07T20:25:51.5905699Z 
2025-05-07T20:25:51.5905703Z 
2025-05-07T20:25:51.5905706Z 
2025-05-07T20:25:51.5905710Z 
2025-05-07T20:25:51.5905714Z 
2025-05-07T20:25:51.5905718Z 
2025-05-07T20:25:51.5905722Z 
2025-05-07T20:25:51.5905725Z 
2025-05-07T20:25:51.5905729Z 
2025-05-07T20:25:51.5905733Z 
2025-05-07T20:25:51.5905736Z 
2025-05-07T20:25:51.5905740Z 
2025-05-07T20:25:51.7838001Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.7838346Z 
2025-05-07T20:25:51.7838350Z 
2025-05-07T20:25:51.7838353Z 
2025-05-07T20:25:51.7838357Z 
2025-05-07T20:25:51.7838362Z 
2025-05-07T20:25:51.7838366Z 
2025-05-07T20:25:51.7838370Z 
2025-05-07T20:25:51.7838374Z 
2025-05-07T20:25:51.7838414Z 
2025-05-07T20:25:51.7838418Z 
2025-05-07T20:25:51.7838422Z 
2025-05-07T20:25:51.7838425Z 
2025-05-07T20:25:51.7838429Z 
2025-05-07T20:25:51.7838433Z 
2025-05-07T20:25:51.7838453Z 
2025-05-07T20:25:51.7838461Z 
2025-05-07T20:25:52.0190553Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.0190911Z 
2025-05-07T20:25:52.0190915Z 
2025-05-07T20:25:52.0190919Z 
2025-05-07T20:25:52.0190923Z 
2025-05-07T20:25:52.0190926Z 
2025-05-07T20:25:52.0190930Z 
2025-05-07T20:25:52.0190933Z 
2025-05-07T20:25:52.0190937Z 
2025-05-07T20:25:52.0190941Z 
2025-05-07T20:25:52.0190944Z 
2025-05-07T20:25:52.0190948Z 
2025-05-07T20:25:52.0829230Z python-3.10.13       | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.0829543Z 
2025-05-07T20:25:52.0829548Z 
2025-05-07T20:25:52.0829551Z 
2025-05-07T20:25:52.0829555Z 
2025-05-07T20:25:52.0829560Z 
2025-05-07T20:25:52.0829564Z 
2025-05-07T20:25:52.0829572Z 
2025-05-07T20:25:52.0829600Z 
2025-05-07T20:25:52.0829604Z 
2025-05-07T20:25:52.0829607Z 
2025-05-07T20:25:52.0829611Z 
2025-05-07T20:25:52.0829615Z 
2025-05-07T20:25:52.0829619Z 
2025-05-07T20:25:52.0829637Z 
2025-05-07T20:25:52.2444494Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.2444823Z 
2025-05-07T20:25:52.2444827Z 
2025-05-07T20:25:52.2444831Z 
2025-05-07T20:25:52.2444834Z 
2025-05-07T20:25:52.2444838Z 
2025-05-07T20:25:52.2444850Z 
2025-05-07T20:25:52.2444854Z 
2025-05-07T20:25:52.2444858Z 
2025-05-07T20:25:52.2444862Z 
2025-05-07T20:25:52.2444865Z 
2025-05-07T20:25:52.2444869Z 
2025-05-07T20:25:52.2444873Z 
2025-05-07T20:25:52.2444877Z 
2025-05-07T20:25:52.2444881Z 
2025-05-07T20:25:52.2444885Z 
2025-05-07T20:25:52.2444889Z 
2025-05-07T20:25:52.2444893Z 
2025-05-07T20:25:52.2444896Z 
2025-05-07T20:25:52.2444900Z 
2025-05-07T20:25:52.3089621Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.3090191Z 
2025-05-07T20:25:52.3090229Z 
2025-05-07T20:25:52.3090233Z 
2025-05-07T20:25:52.3090237Z 
2025-05-07T20:25:52.3090241Z 
2025-05-07T20:25:52.3090244Z 
2025-05-07T20:25:52.3090260Z 
2025-05-07T20:25:52.3090273Z 
2025-05-07T20:25:52.3090277Z 
2025-05-07T20:25:52.3090280Z 
2025-05-07T20:25:52.3090284Z 
2025-05-07T20:25:52.3090288Z 
2025-05-07T20:25:52.3090291Z 
2025-05-07T20:25:52.3090295Z 
2025-05-07T20:25:52.3090306Z 
2025-05-07T20:25:52.3090310Z 
2025-05-07T20:25:52.3090313Z 
2025-05-07T20:25:52.3784921Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.3785392Z 
2025-05-07T20:25:52.3785396Z 
2025-05-07T20:25:52.3785400Z 
2025-05-07T20:25:52.3785404Z 
2025-05-07T20:25:52.3785408Z 
2025-05-07T20:25:52.3785412Z 
2025-05-07T20:25:52.3785417Z 
2025-05-07T20:25:52.3785421Z 
2025-05-07T20:25:52.3785424Z 
2025-05-07T20:25:52.3785428Z 
2025-05-07T20:25:52.3785432Z 
2025-05-07T20:25:52.3785435Z 
2025-05-07T20:25:52.3785439Z 
2025-05-07T20:25:52.3785717Z 
2025-05-07T20:25:52.3785721Z 
2025-05-07T20:25:52.3785725Z 
2025-05-07T20:25:52.3785740Z 
2025-05-07T20:25:52.3785747Z 
2025-05-07T20:25:52.9180819Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.9181195Z 
2025-05-07T20:25:58.3146408Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:58.3155845Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:58.3156109Z 
2025-05-07T20:25:58.3156113Z 
2025-05-07T20:25:58.3156117Z 
2025-05-07T20:25:58.3156121Z 
2025-05-07T20:25:58.3156125Z 
2025-05-07T20:25:58.3156129Z 
2025-05-07T20:25:58.3156133Z 
2025-05-07T20:25:58.3156137Z 
2025-05-07T20:25:58.3156141Z 
2025-05-07T20:25:58.3156145Z 
2025-05-07T20:25:58.3156149Z 
2025-05-07T20:25:58.3156153Z 
2025-05-07T20:25:58.3156156Z 
2025-05-07T20:25:58.3156167Z 
2025-05-07T20:25:58.3156171Z 
2025-05-07T20:25:58.3156175Z 
2025-05-07T20:25:58.3156178Z 
2025-05-07T20:25:58.3156182Z 
2025-05-07T20:25:58.3156204Z 
2025-05-07T20:25:58.3156289Z                       
2025-05-07T20:25:58.3156634Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3156955Z                                                      
2025-05-07T20:25:58.3157156Z 
2025-05-07T20:25:58.3157767Z                                                      [A
2025-05-07T20:25:58.3158197Z 
2025-05-07T20:25:58.3158205Z 
2025-05-07T20:25:58.3158508Z                                                      [A[A
2025-05-07T20:25:58.3158842Z 
2025-05-07T20:25:58.3158851Z 
2025-05-07T20:25:58.3158858Z 
2025-05-07T20:25:58.3159236Z                                                      [A[A[A
2025-05-07T20:25:58.3159631Z 
2025-05-07T20:25:58.3159637Z 
2025-05-07T20:25:58.3159642Z 
2025-05-07T20:25:58.3159647Z 
2025-05-07T20:25:58.3160346Z                                                      [A[A[A[A
2025-05-07T20:25:58.3160717Z 
2025-05-07T20:25:58.3160723Z 
2025-05-07T20:25:58.3160729Z 
2025-05-07T20:25:58.3160735Z 
2025-05-07T20:25:58.3160756Z 
2025-05-07T20:25:58.3161182Z                                                      [A[A[A[A[A
2025-05-07T20:25:58.3161592Z 
2025-05-07T20:25:58.3161607Z 
2025-05-07T20:25:58.3161613Z 
2025-05-07T20:25:58.3161618Z 
2025-05-07T20:25:58.3161623Z 
2025-05-07T20:25:58.3161628Z 
2025-05-07T20:25:58.3162272Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:58.3162647Z 
2025-05-07T20:25:58.3162652Z 
2025-05-07T20:25:58.3162657Z 
2025-05-07T20:25:58.3162663Z 
2025-05-07T20:25:58.3162668Z 
2025-05-07T20:25:58.3162674Z 
2025-05-07T20:25:58.3162685Z 
2025-05-07T20:25:58.3163219Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:58.3163607Z 
2025-05-07T20:25:58.3163613Z 
2025-05-07T20:25:58.3163619Z 
2025-05-07T20:25:58.3163624Z 
2025-05-07T20:25:58.3163630Z 
2025-05-07T20:25:58.3163635Z 
2025-05-07T20:25:58.3163640Z 
2025-05-07T20:25:58.3163651Z 
2025-05-07T20:25:58.3164160Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3164517Z 
2025-05-07T20:25:58.3164522Z 
2025-05-07T20:25:58.3164528Z 
2025-05-07T20:25:58.3164533Z 
2025-05-07T20:25:58.3164545Z 
2025-05-07T20:25:58.3164551Z 
2025-05-07T20:25:58.3164567Z 
2025-05-07T20:25:58.3164573Z 
2025-05-07T20:25:58.3164578Z 
2025-05-07T20:25:58.3165068Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3165421Z 
2025-05-07T20:25:58.3165426Z 
2025-05-07T20:25:58.3165440Z 
2025-05-07T20:25:58.3165446Z 
2025-05-07T20:25:58.3165451Z 
2025-05-07T20:25:58.3165456Z 
2025-05-07T20:25:58.3165462Z 
2025-05-07T20:25:58.3165471Z 
2025-05-07T20:25:58.3165477Z 
2025-05-07T20:25:58.3165482Z 
2025-05-07T20:25:58.3165996Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3166361Z 
2025-05-07T20:25:58.3166366Z 
2025-05-07T20:25:58.3166372Z 
2025-05-07T20:25:58.3166377Z 
2025-05-07T20:25:58.3166383Z 
2025-05-07T20:25:58.3166388Z 
2025-05-07T20:25:58.3166688Z 
2025-05-07T20:25:58.3166693Z 
2025-05-07T20:25:58.3166700Z 
2025-05-07T20:25:58.3166705Z 
2025-05-07T20:25:58.3166720Z 
2025-05-07T20:25:58.3167224Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3167614Z 
2025-05-07T20:25:58.3167621Z 
2025-05-07T20:25:58.3167626Z 
2025-05-07T20:25:58.3167632Z 
2025-05-07T20:25:58.3167638Z 
2025-05-07T20:25:58.3167644Z 
2025-05-07T20:25:58.3167649Z 
2025-05-07T20:25:58.3167655Z 
2025-05-07T20:25:58.3167661Z 
2025-05-07T20:25:58.3167667Z 
2025-05-07T20:25:58.3167673Z 
2025-05-07T20:25:58.3167679Z 
2025-05-07T20:25:58.3168019Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3168382Z 
2025-05-07T20:25:58.3168388Z 
2025-05-07T20:25:58.3168394Z 
2025-05-07T20:25:58.3168400Z 
2025-05-07T20:25:58.3168406Z 
2025-05-07T20:25:58.3168412Z 
2025-05-07T20:25:58.3168418Z 
2025-05-07T20:25:58.3168432Z 
2025-05-07T20:25:58.3168438Z 
2025-05-07T20:25:58.3168456Z 
2025-05-07T20:25:58.3168462Z 
2025-05-07T20:25:58.3168468Z 
2025-05-07T20:25:58.3168474Z 
2025-05-07T20:25:58.3168816Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3169184Z 
2025-05-07T20:25:58.3169190Z 
2025-05-07T20:25:58.3169196Z 
2025-05-07T20:25:58.3169202Z 
2025-05-07T20:25:58.3169208Z 
2025-05-07T20:25:58.3169214Z 
2025-05-07T20:25:58.3169220Z 
2025-05-07T20:25:58.3169225Z 
2025-05-07T20:25:58.3169231Z 
2025-05-07T20:25:58.3169237Z 
2025-05-07T20:25:58.3169243Z 
2025-05-07T20:25:58.3169249Z 
2025-05-07T20:25:58.3169255Z 
2025-05-07T20:25:58.3169261Z 
2025-05-07T20:25:58.3169612Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3169979Z 
2025-05-07T20:25:58.3169986Z 
2025-05-07T20:25:58.3169992Z 
2025-05-07T20:25:58.3169998Z 
2025-05-07T20:25:58.3170004Z 
2025-05-07T20:25:58.3170010Z 
2025-05-07T20:25:58.3170016Z 
2025-05-07T20:25:58.3170022Z 
2025-05-07T20:25:58.3170036Z 
2025-05-07T20:25:58.3170042Z 
2025-05-07T20:25:58.3170047Z 
2025-05-07T20:25:58.3170053Z 
2025-05-07T20:25:58.3170066Z 
2025-05-07T20:25:58.3170072Z 
2025-05-07T20:25:58.3170086Z 
2025-05-07T20:25:58.3170431Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3170795Z 
2025-05-07T20:25:58.3170801Z 
2025-05-07T20:25:58.3170831Z 
2025-05-07T20:25:58.3170837Z 
2025-05-07T20:25:58.3170843Z 
2025-05-07T20:25:58.3170849Z 
2025-05-07T20:25:58.3170855Z 
2025-05-07T20:25:58.3170861Z 
2025-05-07T20:25:58.3170867Z 
2025-05-07T20:25:58.3170873Z 
2025-05-07T20:25:58.3170878Z 
2025-05-07T20:25:58.3170884Z 
2025-05-07T20:25:58.3170890Z 
2025-05-07T20:25:58.3170896Z 
2025-05-07T20:25:58.3170902Z 
2025-05-07T20:25:58.3170908Z 
2025-05-07T20:25:58.3171262Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3171637Z 
2025-05-07T20:25:58.3171643Z 
2025-05-07T20:25:58.3171649Z 
2025-05-07T20:25:58.3171663Z 
2025-05-07T20:25:58.3171669Z 
2025-05-07T20:25:58.3171675Z 
2025-05-07T20:25:58.3171681Z 
2025-05-07T20:25:58.3171698Z 
2025-05-07T20:25:58.3171710Z 
2025-05-07T20:25:58.3171716Z 
2025-05-07T20:25:58.3171722Z 
2025-05-07T20:25:58.3171728Z 
2025-05-07T20:25:58.3171733Z 
2025-05-07T20:25:58.3171748Z 
2025-05-07T20:25:58.3171753Z 
2025-05-07T20:25:58.3171759Z 
2025-05-07T20:25:58.3171765Z 
2025-05-07T20:25:58.3172132Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3172515Z 
2025-05-07T20:25:58.3172521Z 
2025-05-07T20:25:58.3172535Z 
2025-05-07T20:25:58.3172541Z 
2025-05-07T20:25:58.3172547Z 
2025-05-07T20:25:58.3172553Z 
2025-05-07T20:25:58.3172559Z 
2025-05-07T20:25:58.3172565Z 
2025-05-07T20:25:58.3172571Z 
2025-05-07T20:25:58.3172577Z 
2025-05-07T20:25:58.3172583Z 
2025-05-07T20:25:58.3172589Z 
2025-05-07T20:25:58.3172595Z 
2025-05-07T20:25:58.3172601Z 
2025-05-07T20:25:58.3172607Z 
2025-05-07T20:25:58.3172746Z 
2025-05-07T20:25:58.3172751Z 
2025-05-07T20:25:58.3172756Z 
2025-05-07T20:25:58.3173671Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3173938Z 
2025-05-07T20:25:58.3173943Z 
2025-05-07T20:25:58.3174060Z [A
2025-05-07T20:25:58.3174165Z 
2025-05-07T20:25:58.3174169Z 
2025-05-07T20:25:58.3174843Z [A[A
2025-05-07T20:25:58.3175035Z 
2025-05-07T20:25:58.3175042Z 
2025-05-07T20:25:58.3175051Z 
2025-05-07T20:25:58.3175726Z [A[A[A
2025-05-07T20:25:58.3175916Z 
2025-05-07T20:25:58.3175923Z 
2025-05-07T20:25:58.3175929Z 
2025-05-07T20:25:58.3175938Z 
2025-05-07T20:25:58.3176336Z [A[A[A[A
2025-05-07T20:25:58.3176506Z 
2025-05-07T20:25:58.3176517Z 
2025-05-07T20:25:58.3176522Z 
2025-05-07T20:25:58.3176527Z 
2025-05-07T20:25:58.3176533Z 
2025-05-07T20:25:58.3177128Z [A[A[A[A[A
2025-05-07T20:25:58.3177278Z 
2025-05-07T20:25:58.3177283Z 
2025-05-07T20:25:58.3177287Z 
2025-05-07T20:25:58.3177290Z 
2025-05-07T20:25:58.3177311Z 
2025-05-07T20:25:58.3177315Z 
2025-05-07T20:25:58.3177822Z [A[A[A[A[A[A
2025-05-07T20:25:58.3178052Z 
2025-05-07T20:25:58.3178058Z 
2025-05-07T20:25:58.3178075Z 
2025-05-07T20:25:58.3178081Z 
2025-05-07T20:25:58.3178087Z 
2025-05-07T20:25:58.3178093Z 
2025-05-07T20:25:58.3178103Z 
2025-05-07T20:25:58.3178585Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3178831Z 
2025-05-07T20:25:58.3178837Z 
2025-05-07T20:25:58.3178843Z 
2025-05-07T20:25:58.3178849Z 
2025-05-07T20:25:58.3178855Z 
2025-05-07T20:25:58.3178860Z 
2025-05-07T20:25:58.3178871Z 
2025-05-07T20:25:58.3178876Z 
2025-05-07T20:25:58.3179387Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3179579Z 
2025-05-07T20:25:58.3179590Z 
2025-05-07T20:25:58.3179594Z 
2025-05-07T20:25:58.3179605Z 
2025-05-07T20:25:58.3179609Z 
2025-05-07T20:25:58.3179612Z 
2025-05-07T20:25:58.3179616Z 
2025-05-07T20:25:58.3179619Z 
2025-05-07T20:25:58.3179623Z 
2025-05-07T20:25:58.3180218Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3180501Z 
2025-05-07T20:25:58.3180507Z 
2025-05-07T20:25:58.3180523Z 
2025-05-07T20:25:58.3180529Z 
2025-05-07T20:25:58.3180535Z 
2025-05-07T20:25:58.3180547Z 
2025-05-07T20:25:58.3180553Z 
2025-05-07T20:25:58.3180566Z 
2025-05-07T20:25:58.3180572Z 
2025-05-07T20:25:58.3180578Z 
2025-05-07T20:25:58.3181096Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3181376Z 
2025-05-07T20:25:58.3181382Z 
2025-05-07T20:25:58.3181388Z 
2025-05-07T20:25:58.3181394Z 
2025-05-07T20:25:58.3181400Z 
2025-05-07T20:25:58.3181411Z 
2025-05-07T20:25:58.3181417Z 
2025-05-07T20:25:58.3181422Z 
2025-05-07T20:25:58.3181427Z 
2025-05-07T20:25:58.3181432Z 
2025-05-07T20:25:58.3181446Z 
2025-05-07T20:25:58.3181855Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3182049Z 
2025-05-07T20:25:58.3182053Z 
2025-05-07T20:25:58.3182057Z 
2025-05-07T20:25:58.3182061Z 
2025-05-07T20:25:58.3182065Z 
2025-05-07T20:25:58.3182068Z 
2025-05-07T20:25:58.3182077Z 
2025-05-07T20:25:58.3182081Z 
2025-05-07T20:25:58.3182084Z 
2025-05-07T20:25:58.3182088Z 
2025-05-07T20:25:58.3182092Z 
2025-05-07T20:25:58.3182103Z 
2025-05-07T20:25:58.3189570Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3190075Z 
2025-05-07T20:25:58.3190081Z 
2025-05-07T20:25:58.3190100Z 
2025-05-07T20:25:58.3190106Z 
2025-05-07T20:25:58.3190111Z 
2025-05-07T20:25:58.3190116Z 
2025-05-07T20:25:58.3190147Z 
2025-05-07T20:25:58.3190152Z 
2025-05-07T20:25:58.3190168Z 
2025-05-07T20:25:58.3190174Z 
2025-05-07T20:25:58.3190179Z 
2025-05-07T20:25:58.3190184Z 
2025-05-07T20:25:58.3190189Z 
2025-05-07T20:25:58.3190384Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3190578Z 
2025-05-07T20:25:58.3190582Z 
2025-05-07T20:25:58.3190586Z 
2025-05-07T20:25:58.3190598Z 
2025-05-07T20:25:58.3190602Z 
2025-05-07T20:25:58.3190606Z 
2025-05-07T20:25:58.3190610Z 
2025-05-07T20:25:58.3190613Z 
2025-05-07T20:25:58.3190617Z 
2025-05-07T20:25:58.3190621Z 
2025-05-07T20:25:58.3190624Z 
2025-05-07T20:25:58.3190628Z 
2025-05-07T20:25:58.3190631Z 
2025-05-07T20:25:58.3190635Z 
2025-05-07T20:25:58.3190778Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3191172Z 
2025-05-07T20:25:58.3191176Z 
2025-05-07T20:25:58.3191179Z 
2025-05-07T20:25:58.3191183Z 
2025-05-07T20:25:58.3191285Z 
2025-05-07T20:25:58.3191289Z 
2025-05-07T20:25:58.3191293Z 
2025-05-07T20:25:58.3191297Z 
2025-05-07T20:25:58.3191300Z 
2025-05-07T20:25:58.3191304Z 
2025-05-07T20:25:58.3191307Z 
2025-05-07T20:25:58.3191311Z 
2025-05-07T20:25:58.3191314Z 
2025-05-07T20:25:58.3191318Z 
2025-05-07T20:25:58.3191322Z 
2025-05-07T20:25:58.3191480Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3191676Z 
2025-05-07T20:25:58.3191680Z 
2025-05-07T20:25:58.3191684Z 
2025-05-07T20:25:58.3191687Z 
2025-05-07T20:25:58.3191691Z 
2025-05-07T20:25:58.3191695Z 
2025-05-07T20:25:58.3191698Z 
2025-05-07T20:25:58.3191702Z 
2025-05-07T20:25:58.3191705Z 
2025-05-07T20:25:58.3191709Z 
2025-05-07T20:25:58.3191712Z 
2025-05-07T20:25:58.3191716Z 
2025-05-07T20:25:58.3191727Z 
2025-05-07T20:25:58.3191730Z 
2025-05-07T20:25:58.3191734Z 
2025-05-07T20:25:58.3191738Z 
2025-05-07T20:25:58.3191893Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3192096Z 
2025-05-07T20:25:58.3192100Z 
2025-05-07T20:25:58.3192108Z 
2025-05-07T20:25:58.3192124Z 
2025-05-07T20:25:58.3192128Z 
2025-05-07T20:25:58.3192131Z 
2025-05-07T20:25:58.3192135Z 
2025-05-07T20:25:58.3192138Z 
2025-05-07T20:25:58.3192142Z 
2025-05-07T20:25:58.3192145Z 
2025-05-07T20:25:58.3192149Z 
2025-05-07T20:25:58.3192153Z 
2025-05-07T20:25:58.3192156Z 
2025-05-07T20:25:58.3192160Z 
2025-05-07T20:25:58.3192195Z 
2025-05-07T20:25:58.3192199Z 
2025-05-07T20:25:58.3192203Z 
2025-05-07T20:25:58.3192390Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3192685Z 
2025-05-07T20:25:58.3192688Z 
2025-05-07T20:25:58.3192692Z 
2025-05-07T20:25:58.3192695Z 
2025-05-07T20:25:58.3192707Z 
2025-05-07T20:25:58.3192711Z 
2025-05-07T20:25:58.3192715Z 
2025-05-07T20:25:58.3192718Z 
2025-05-07T20:25:58.3192722Z 
2025-05-07T20:25:58.3192726Z 
2025-05-07T20:25:58.3192729Z 
2025-05-07T20:25:58.3192746Z 
2025-05-07T20:25:58.3192749Z 
2025-05-07T20:25:58.3192753Z 
2025-05-07T20:25:58.3192756Z 
2025-05-07T20:25:58.3192760Z 
2025-05-07T20:25:58.3192764Z 
2025-05-07T20:25:58.3192772Z 
2025-05-07T20:25:58.3192995Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3193245Z 
2025-05-07T20:25:58.3193249Z 
2025-05-07T20:25:58.3193351Z [A
2025-05-07T20:25:58.3193500Z 
2025-05-07T20:25:58.3193505Z 
2025-05-07T20:25:58.3193646Z [A[A
2025-05-07T20:25:58.3193755Z 
2025-05-07T20:25:58.3193759Z 
2025-05-07T20:25:58.3193762Z 
2025-05-07T20:25:58.3193875Z [A[A[A
2025-05-07T20:25:58.3193983Z 
2025-05-07T20:25:58.3193987Z 
2025-05-07T20:25:58.3193991Z 
2025-05-07T20:25:58.3193994Z 
2025-05-07T20:25:58.3194106Z [A[A[A[A
2025-05-07T20:25:58.3194227Z 
2025-05-07T20:25:58.3194230Z 
2025-05-07T20:25:58.3194234Z 
2025-05-07T20:25:58.3194238Z 
2025-05-07T20:25:58.3194241Z 
2025-05-07T20:25:58.3194347Z [A[A[A[A[A
2025-05-07T20:25:58.3194481Z 
2025-05-07T20:25:58.3194485Z 
2025-05-07T20:25:58.3194494Z 
2025-05-07T20:25:58.3194498Z 
2025-05-07T20:25:58.3194502Z 
2025-05-07T20:25:58.3194505Z 
2025-05-07T20:25:58.3194616Z [A[A[A[A[A[A
2025-05-07T20:25:58.3194755Z 
2025-05-07T20:25:58.3194759Z 
2025-05-07T20:25:58.3194762Z 
2025-05-07T20:25:58.3194766Z 
2025-05-07T20:25:58.3194770Z 
2025-05-07T20:25:58.3194773Z 
2025-05-07T20:25:58.3194777Z 
2025-05-07T20:25:58.3194893Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3195038Z 
2025-05-07T20:25:58.3195042Z 
2025-05-07T20:25:58.3195045Z 
2025-05-07T20:25:58.3195049Z 
2025-05-07T20:25:58.3195052Z 
2025-05-07T20:25:58.3195056Z 
2025-05-07T20:25:58.3195060Z 
2025-05-07T20:25:58.3195063Z 
2025-05-07T20:25:58.3195182Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3195336Z 
2025-05-07T20:25:58.3195340Z 
2025-05-07T20:25:58.3195343Z 
2025-05-07T20:25:58.3195347Z 
2025-05-07T20:25:58.3195351Z 
2025-05-07T20:25:58.3195354Z 
2025-05-07T20:25:58.3195358Z 
2025-05-07T20:25:58.3195362Z 
2025-05-07T20:25:58.3195365Z 
2025-05-07T20:25:58.3195486Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3195751Z 
2025-05-07T20:25:58.3195754Z 
2025-05-07T20:25:58.3195758Z 
2025-05-07T20:25:58.3195762Z 
2025-05-07T20:25:58.3195765Z 
2025-05-07T20:25:58.3195838Z 
2025-05-07T20:25:58.3195842Z 
2025-05-07T20:25:58.3195846Z 
2025-05-07T20:25:58.3195850Z 
2025-05-07T20:25:58.3195853Z 
2025-05-07T20:25:58.3195990Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3196151Z 
2025-05-07T20:25:58.3196155Z 
2025-05-07T20:25:58.3196159Z 
2025-05-07T20:25:58.3196162Z 
2025-05-07T20:25:58.3196166Z 
2025-05-07T20:25:58.3196169Z 
2025-05-07T20:25:58.3196173Z 
2025-05-07T20:25:58.3196176Z 
2025-05-07T20:25:58.3196180Z 
2025-05-07T20:25:58.3196184Z 
2025-05-07T20:25:58.3196187Z 
2025-05-07T20:25:58.3196326Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3196500Z 
2025-05-07T20:25:58.3196503Z 
2025-05-07T20:25:58.3196507Z 
2025-05-07T20:25:58.3196511Z 
2025-05-07T20:25:58.3196514Z 
2025-05-07T20:25:58.3196518Z 
2025-05-07T20:25:58.3196521Z 
2025-05-07T20:25:58.3196525Z 
2025-05-07T20:25:58.3196534Z 
2025-05-07T20:25:58.3196538Z 
2025-05-07T20:25:58.3196541Z 
2025-05-07T20:25:58.3196552Z 
2025-05-07T20:25:58.3196687Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3196872Z 
2025-05-07T20:25:58.3196875Z 
2025-05-07T20:25:58.3196879Z 
2025-05-07T20:25:58.3196883Z 
2025-05-07T20:25:58.3196887Z 
2025-05-07T20:25:58.3196890Z 
2025-05-07T20:25:58.3196901Z 
2025-05-07T20:25:58.3196905Z 
2025-05-07T20:25:58.3196909Z 
2025-05-07T20:25:58.3196913Z 
2025-05-07T20:25:58.3196916Z 
2025-05-07T20:25:58.3196920Z 
2025-05-07T20:25:58.3196924Z 
2025-05-07T20:25:58.3197056Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3197237Z 
2025-05-07T20:25:58.3197248Z 
2025-05-07T20:25:58.3197252Z 
2025-05-07T20:25:58.3197255Z 
2025-05-07T20:25:58.3197259Z 
2025-05-07T20:25:58.3197263Z 
2025-05-07T20:25:58.3197266Z 
2025-05-07T20:25:58.3197270Z 
2025-05-07T20:25:58.3197274Z 
2025-05-07T20:25:58.3197277Z 
2025-05-07T20:25:58.3197281Z 
2025-05-07T20:25:58.3197285Z 
2025-05-07T20:25:58.3197288Z 
2025-05-07T20:25:58.3197296Z 
2025-05-07T20:25:58.3197437Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3197634Z 
2025-05-07T20:25:58.3197638Z 
2025-05-07T20:25:58.3197646Z 
2025-05-07T20:25:58.3197650Z 
2025-05-07T20:25:58.3197653Z 
2025-05-07T20:25:58.3197657Z 
2025-05-07T20:25:58.3197661Z 
2025-05-07T20:25:58.3197664Z 
2025-05-07T20:25:58.3197668Z 
2025-05-07T20:25:58.3197671Z 
2025-05-07T20:25:58.3197675Z 
2025-05-07T20:25:58.3197679Z 
2025-05-07T20:25:58.3197682Z 
2025-05-07T20:25:58.3197686Z 
2025-05-07T20:25:58.3197689Z 
2025-05-07T20:25:58.3197844Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3198042Z 
2025-05-07T20:25:58.3198045Z 
2025-05-07T20:25:58.3198049Z 
2025-05-07T20:25:58.3198053Z 
2025-05-07T20:25:58.3198057Z 
2025-05-07T20:25:58.3198060Z 
2025-05-07T20:25:58.3198064Z 
2025-05-07T20:25:58.3198067Z 
2025-05-07T20:25:58.3198071Z 
2025-05-07T20:25:58.3198075Z 
2025-05-07T20:25:58.3198086Z 
2025-05-07T20:25:58.3198090Z 
2025-05-07T20:25:58.3198093Z 
2025-05-07T20:25:58.3198102Z 
2025-05-07T20:25:58.3198105Z 
2025-05-07T20:25:58.3198109Z 
2025-05-07T20:25:58.3198277Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3198487Z 
2025-05-07T20:25:58.3198490Z 
2025-05-07T20:25:58.3198494Z 
2025-05-07T20:25:58.3198498Z 
2025-05-07T20:25:58.3198501Z 
2025-05-07T20:25:58.3198505Z 
2025-05-07T20:25:58.3198508Z 
2025-05-07T20:25:58.3198512Z 
2025-05-07T20:25:58.3198515Z 
2025-05-07T20:25:58.3198519Z 
2025-05-07T20:25:58.3198523Z 
2025-05-07T20:25:58.3198526Z 
2025-05-07T20:25:58.3198530Z 
2025-05-07T20:25:58.3198533Z 
2025-05-07T20:25:58.3198537Z 
2025-05-07T20:25:58.3198541Z 
2025-05-07T20:25:58.3198544Z 
2025-05-07T20:25:58.3198706Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3198908Z 
2025-05-07T20:25:58.3198912Z 
2025-05-07T20:25:58.3198916Z 
2025-05-07T20:25:58.3198919Z 
2025-05-07T20:25:58.3198923Z 
2025-05-07T20:25:58.3198927Z 
2025-05-07T20:25:58.3198930Z 
2025-05-07T20:25:58.3198934Z 
2025-05-07T20:25:58.3198945Z 
2025-05-07T20:25:58.3199033Z 
2025-05-07T20:25:58.3199036Z 
2025-05-07T20:25:58.3199040Z 
2025-05-07T20:25:58.3199043Z 
2025-05-07T20:25:58.3199047Z 
2025-05-07T20:25:58.3199051Z 
2025-05-07T20:25:58.3199123Z 
2025-05-07T20:25:58.3199128Z 
2025-05-07T20:25:58.3199131Z 
2025-05-07T20:25:58.3199295Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3199510Z 
2025-05-07T20:25:58.3199513Z 
2025-05-07T20:25:58.3199613Z [A
2025-05-07T20:25:58.3199714Z 
2025-05-07T20:25:58.3199718Z 
2025-05-07T20:25:58.3199827Z [A[A
2025-05-07T20:25:58.3199935Z 
2025-05-07T20:25:58.3199939Z 
2025-05-07T20:25:58.3199942Z 
2025-05-07T20:25:58.3200051Z [A[A[A
2025-05-07T20:25:58.3200160Z 
2025-05-07T20:25:58.3200164Z 
2025-05-07T20:25:58.3200168Z 
2025-05-07T20:25:58.3200172Z 
2025-05-07T20:25:58.3200274Z [A[A[A[A
2025-05-07T20:25:58.3200397Z 
2025-05-07T20:25:58.3200400Z 
2025-05-07T20:25:58.3200404Z 
2025-05-07T20:25:58.3200408Z 
2025-05-07T20:25:58.3200411Z 
2025-05-07T20:25:58.3200521Z [A[A[A[A[A
2025-05-07T20:25:58.3200658Z 
2025-05-07T20:25:58.3200662Z 
2025-05-07T20:25:58.3200666Z 
2025-05-07T20:25:58.3200669Z 
2025-05-07T20:25:58.3200673Z 
2025-05-07T20:25:58.3200681Z 
2025-05-07T20:25:58.3200793Z [A[A[A[A[A[A
2025-05-07T20:25:58.3200928Z 
2025-05-07T20:25:58.3200931Z 
2025-05-07T20:25:58.3200935Z 
2025-05-07T20:25:58.3200939Z 
2025-05-07T20:25:58.3200942Z 
2025-05-07T20:25:58.3200946Z 
2025-05-07T20:25:58.3200950Z 
2025-05-07T20:25:58.3201065Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3201209Z 
2025-05-07T20:25:58.3201213Z 
2025-05-07T20:25:58.3201217Z 
2025-05-07T20:25:58.3201220Z 
2025-05-07T20:25:58.3201224Z 
2025-05-07T20:25:58.3201228Z 
2025-05-07T20:25:58.3201231Z 
2025-05-07T20:25:58.3201235Z 
2025-05-07T20:25:58.3201358Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3201514Z 
2025-05-07T20:25:58.3201518Z 
2025-05-07T20:25:58.3201522Z 
2025-05-07T20:25:58.3201525Z 
2025-05-07T20:25:58.3201529Z 
2025-05-07T20:25:58.3201533Z 
2025-05-07T20:25:58.3201537Z 
2025-05-07T20:25:58.3201540Z 
2025-05-07T20:25:58.3201547Z 
2025-05-07T20:25:58.3201694Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3201845Z 
2025-05-07T20:25:58.3201849Z 
2025-05-07T20:25:58.3201857Z 
2025-05-07T20:25:58.3201861Z 
2025-05-07T20:25:58.3201864Z 
2025-05-07T20:25:58.3201868Z 
2025-05-07T20:25:58.3201872Z 
2025-05-07T20:25:58.3201875Z 
2025-05-07T20:25:58.3201879Z 
2025-05-07T20:25:58.3201882Z 
2025-05-07T20:25:58.3202015Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3202180Z 
2025-05-07T20:25:58.3202183Z 
2025-05-07T20:25:58.3202187Z 
2025-05-07T20:25:58.3202191Z 
2025-05-07T20:25:58.3202194Z 
2025-05-07T20:25:58.3202198Z 
2025-05-07T20:25:58.3202202Z 
2025-05-07T20:25:58.3202205Z 
2025-05-07T20:25:58.3202209Z 
2025-05-07T20:25:58.3202212Z 
2025-05-07T20:25:58.3202216Z 
2025-05-07T20:25:58.3202356Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3202532Z 
2025-05-07T20:25:58.3202536Z 
2025-05-07T20:25:58.3202539Z 
2025-05-07T20:25:58.3202543Z 
2025-05-07T20:25:58.3202547Z 
2025-05-07T20:25:58.3202550Z 
2025-05-07T20:25:58.3202558Z 
2025-05-07T20:25:58.3202562Z 
2025-05-07T20:25:58.3202571Z 
2025-05-07T20:25:58.3202575Z 
2025-05-07T20:25:58.3202578Z 
2025-05-07T20:25:58.3202582Z 
2025-05-07T20:25:58.3202716Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3202897Z 
2025-05-07T20:25:58.3202901Z 
2025-05-07T20:25:58.3202905Z 
2025-05-07T20:25:58.3202916Z 
2025-05-07T20:25:58.3202919Z 
2025-05-07T20:25:58.3202923Z 
2025-05-07T20:25:58.3202927Z 
2025-05-07T20:25:58.3202930Z 
2025-05-07T20:25:58.3202934Z 
2025-05-07T20:25:58.3202938Z 
2025-05-07T20:25:58.3202941Z 
2025-05-07T20:25:58.3202945Z 
2025-05-07T20:25:58.3202949Z 
2025-05-07T20:25:58.3203082Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3203271Z 
2025-05-07T20:25:58.3203275Z 
2025-05-07T20:25:58.3203278Z 
2025-05-07T20:25:58.3203282Z 
2025-05-07T20:25:58.3203286Z 
2025-05-07T20:25:58.3203290Z 
2025-05-07T20:25:58.3203294Z 
2025-05-07T20:25:58.3203297Z 
2025-05-07T20:25:58.3203301Z 
2025-05-07T20:25:58.3203305Z 
2025-05-07T20:25:58.3203309Z 
2025-05-07T20:25:58.3203422Z 
2025-05-07T20:25:58.3203426Z 
2025-05-07T20:25:58.3203430Z 
2025-05-07T20:25:58.3203571Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3203838Z 
2025-05-07T20:25:58.3203842Z 
2025-05-07T20:25:58.3203853Z 
2025-05-07T20:25:58.3203857Z 
2025-05-07T20:25:58.3203861Z 
2025-05-07T20:25:58.3203864Z 
2025-05-07T20:25:58.3203868Z 
2025-05-07T20:25:58.3203872Z 
2025-05-07T20:25:58.3203875Z 
2025-05-07T20:25:58.3203879Z 
2025-05-07T20:25:58.3203883Z 
2025-05-07T20:25:58.3203886Z 
2025-05-07T20:25:58.3203890Z 
2025-05-07T20:25:58.3203894Z 
2025-05-07T20:25:58.3203897Z 
2025-05-07T20:25:58.3204061Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3204253Z 
2025-05-07T20:25:58.3204258Z 
2025-05-07T20:25:58.3204261Z 
2025-05-07T20:25:58.3204265Z 
2025-05-07T20:25:58.3204269Z 
2025-05-07T20:25:58.3204272Z 
2025-05-07T20:25:58.3204276Z 
2025-05-07T20:25:58.3204280Z 
2025-05-07T20:25:58.3204283Z 
2025-05-07T20:25:58.3204287Z 
2025-05-07T20:25:58.3204290Z 
2025-05-07T20:25:58.3204299Z 
2025-05-07T20:25:58.3204303Z 
2025-05-07T20:25:58.3204306Z 
2025-05-07T20:25:58.3204310Z 
2025-05-07T20:25:58.3204319Z 
2025-05-07T20:25:58.3204474Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3204672Z 
2025-05-07T20:25:58.3204676Z 
2025-05-07T20:25:58.3204679Z 
2025-05-07T20:25:58.3204683Z 
2025-05-07T20:25:58.3204687Z 
2025-05-07T20:25:58.3204690Z 
2025-05-07T20:25:58.3204700Z 
2025-05-07T20:25:58.3204704Z 
2025-05-07T20:25:58.3204708Z 
2025-05-07T20:25:58.3204711Z 
2025-05-07T20:25:58.3204715Z 
2025-05-07T20:25:58.3204718Z 
2025-05-07T20:25:58.3204722Z 
2025-05-07T20:25:58.3204725Z 
2025-05-07T20:25:58.3204729Z 
2025-05-07T20:25:58.3204733Z 
2025-05-07T20:25:58.3204736Z 
2025-05-07T20:25:58.3204890Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3205096Z 
2025-05-07T20:25:58.3205100Z 
2025-05-07T20:25:58.3205103Z 
2025-05-07T20:25:58.3205107Z 
2025-05-07T20:25:58.3205111Z 
2025-05-07T20:25:58.3205114Z 
2025-05-07T20:25:58.3205118Z 
2025-05-07T20:25:58.3205125Z 
2025-05-07T20:25:58.3205129Z 
2025-05-07T20:25:58.3205132Z 
2025-05-07T20:25:58.3205136Z 
2025-05-07T20:25:58.3205139Z 
2025-05-07T20:25:58.3205147Z 
2025-05-07T20:25:58.3205151Z 
2025-05-07T20:25:58.3205154Z 
2025-05-07T20:25:58.3205158Z 
2025-05-07T20:25:58.3205162Z 
2025-05-07T20:25:58.3205165Z 
2025-05-07T20:25:58.3205332Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3205535Z 
2025-05-07T20:25:58.3205538Z 
2025-05-07T20:25:58.3205651Z [A
2025-05-07T20:25:58.3205751Z 
2025-05-07T20:25:58.3205754Z 
2025-05-07T20:25:58.3205852Z [A[A
2025-05-07T20:25:58.3205964Z 
2025-05-07T20:25:58.3205967Z 
2025-05-07T20:25:58.3205971Z 
2025-05-07T20:25:58.3206072Z [A[A[A
2025-05-07T20:25:58.3206185Z 
2025-05-07T20:25:58.3206189Z 
2025-05-07T20:25:58.3206192Z 
2025-05-07T20:25:58.3206196Z 
2025-05-07T20:25:58.3206302Z [A[A[A[A
2025-05-07T20:25:58.3206420Z 
2025-05-07T20:25:58.3206424Z 
2025-05-07T20:25:58.3206428Z 
2025-05-07T20:25:58.3206437Z 
2025-05-07T20:25:58.3206444Z 
2025-05-07T20:25:58.3206553Z [A[A[A[A[A
2025-05-07T20:25:58.3206674Z 
2025-05-07T20:25:58.3206678Z 
2025-05-07T20:25:58.3206682Z 
2025-05-07T20:25:58.3206690Z 
2025-05-07T20:25:58.3206699Z 
2025-05-07T20:25:58.3206703Z 
2025-05-07T20:25:58.3206815Z [A[A[A[A[A[A
2025-05-07T20:25:58.3206940Z 
2025-05-07T20:25:58.3206944Z 
2025-05-07T20:25:58.3206948Z 
2025-05-07T20:25:58.3206952Z 
2025-05-07T20:25:58.3206955Z 
2025-05-07T20:25:58.3206959Z 
2025-05-07T20:25:58.3206970Z 
2025-05-07T20:25:58.3207086Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3207221Z 
2025-05-07T20:25:58.3207225Z 
2025-05-07T20:25:58.3207229Z 
2025-05-07T20:25:58.3207233Z 
2025-05-07T20:25:58.3207236Z 
2025-05-07T20:25:58.3207245Z 
2025-05-07T20:25:58.3207249Z 
2025-05-07T20:25:58.3207253Z 
2025-05-07T20:25:58.3207682Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3208006Z 
2025-05-07T20:25:58.3208011Z 
2025-05-07T20:25:58.3208017Z 
2025-05-07T20:25:58.3208021Z 
2025-05-07T20:25:58.3208034Z 
2025-05-07T20:25:58.3208052Z 
2025-05-07T20:25:58.3208230Z 
2025-05-07T20:25:58.3208236Z 
2025-05-07T20:25:58.3208241Z 
2025-05-07T20:25:58.3208437Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3208751Z 
2025-05-07T20:25:58.3208758Z 
2025-05-07T20:25:58.3208763Z 
2025-05-07T20:25:58.3208768Z 
2025-05-07T20:25:58.3208781Z 
2025-05-07T20:25:58.3208787Z 
2025-05-07T20:25:58.3208791Z 
2025-05-07T20:25:58.3208797Z 
2025-05-07T20:25:58.3208802Z 
2025-05-07T20:25:58.3208807Z 
2025-05-07T20:25:58.3209005Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3209237Z 
2025-05-07T20:25:58.3209242Z 
2025-05-07T20:25:58.3209247Z 
2025-05-07T20:25:58.3209252Z 
2025-05-07T20:25:58.3209257Z 
2025-05-07T20:25:58.3209262Z 
2025-05-07T20:25:58.3209267Z 
2025-05-07T20:25:58.3209272Z 
2025-05-07T20:25:58.3209278Z 
2025-05-07T20:25:58.3209282Z 
2025-05-07T20:25:58.3209287Z 
2025-05-07T20:25:58.3209465Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3209709Z 
2025-05-07T20:25:58.3209715Z 
2025-05-07T20:25:58.3209720Z 
2025-05-07T20:25:58.3209725Z 
2025-05-07T20:25:58.3209738Z 
2025-05-07T20:25:58.3209743Z 
2025-05-07T20:25:58.3209757Z 
2025-05-07T20:25:58.3209762Z 
2025-05-07T20:25:58.3209768Z 
2025-05-07T20:25:58.3209779Z 
2025-05-07T20:25:58.3209784Z 
2025-05-07T20:25:58.3209789Z 
2025-05-07T20:25:58.3209977Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3210223Z 
2025-05-07T20:25:58.3210228Z 
2025-05-07T20:25:58.3210233Z 
2025-05-07T20:25:58.3210238Z 
2025-05-07T20:25:58.3210243Z 
2025-05-07T20:25:58.3210248Z 
2025-05-07T20:25:58.3210253Z 
2025-05-07T20:25:58.3210258Z 
2025-05-07T20:25:58.3210263Z 
2025-05-07T20:25:58.3210268Z 
2025-05-07T20:25:58.3210273Z 
2025-05-07T20:25:58.3210279Z 
2025-05-07T20:25:58.3210294Z 
2025-05-07T20:25:58.3210492Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3210750Z 
2025-05-07T20:25:58.3210755Z 
2025-05-07T20:25:58.3210760Z 
2025-05-07T20:25:58.3210765Z 
2025-05-07T20:25:58.3210777Z 
2025-05-07T20:25:58.3210783Z 
2025-05-07T20:25:58.3210789Z 
2025-05-07T20:25:58.3210794Z 
2025-05-07T20:25:58.3210808Z 
2025-05-07T20:25:58.3210813Z 
2025-05-07T20:25:58.3210818Z 
2025-05-07T20:25:58.3210823Z 
2025-05-07T20:25:58.3210828Z 
2025-05-07T20:25:58.3210833Z 
2025-05-07T20:25:58.3211036Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3211315Z 
2025-05-07T20:25:58.3211320Z 
2025-05-07T20:25:58.3211326Z 
2025-05-07T20:25:58.3211331Z 
2025-05-07T20:25:58.3211336Z 
2025-05-07T20:25:58.3211341Z 
2025-05-07T20:25:58.3211347Z 
2025-05-07T20:25:58.3211352Z 
2025-05-07T20:25:58.3211357Z 
2025-05-07T20:25:58.3211362Z 
2025-05-07T20:25:58.3211368Z 
2025-05-07T20:25:58.3211373Z 
2025-05-07T20:25:58.3211378Z 
2025-05-07T20:25:58.3211382Z 
2025-05-07T20:25:58.3211387Z 
2025-05-07T20:25:58.3211618Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3211896Z 
2025-05-07T20:25:58.3211901Z 
2025-05-07T20:25:58.3211906Z 
2025-05-07T20:25:58.3211911Z 
2025-05-07T20:25:58.3211916Z 
2025-05-07T20:25:58.3211921Z 
2025-05-07T20:25:58.3211926Z 
2025-05-07T20:25:58.3211931Z 
2025-05-07T20:25:58.3211936Z 
2025-05-07T20:25:58.3211947Z 
2025-05-07T20:25:58.3211960Z 
2025-05-07T20:25:58.3211965Z 
2025-05-07T20:25:58.3211970Z 
2025-05-07T20:25:58.3211975Z 
2025-05-07T20:25:58.3211985Z 
2025-05-07T20:25:58.3211991Z 
2025-05-07T20:25:58.3212200Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3212502Z 
2025-05-07T20:25:58.3212514Z 
2025-05-07T20:25:58.3212520Z 
2025-05-07T20:25:58.3212526Z 
2025-05-07T20:25:58.3212533Z 
2025-05-07T20:25:58.3212539Z 
2025-05-07T20:25:58.3212545Z 
2025-05-07T20:25:58.3212552Z 
2025-05-07T20:25:58.3212558Z 
2025-05-07T20:25:58.3212564Z 
2025-05-07T20:25:58.3212570Z 
2025-05-07T20:25:58.3212577Z 
2025-05-07T20:25:58.3212583Z 
2025-05-07T20:25:58.3212589Z 
2025-05-07T20:25:58.3212596Z 
2025-05-07T20:25:58.3212602Z 
2025-05-07T20:25:58.3212608Z 
2025-05-07T20:25:58.3212843Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3213127Z 
2025-05-07T20:25:58.3213132Z 
2025-05-07T20:25:58.3213137Z 
2025-05-07T20:25:58.3213142Z 
2025-05-07T20:25:58.3213252Z 
2025-05-07T20:25:58.3213257Z 
2025-05-07T20:25:58.3213262Z 
2025-05-07T20:25:58.3213267Z 
2025-05-07T20:25:58.3213272Z 
2025-05-07T20:25:58.3213277Z 
2025-05-07T20:25:58.3213364Z 
2025-05-07T20:25:58.3213370Z 
2025-05-07T20:25:58.3213383Z 
2025-05-07T20:25:58.3213388Z 
2025-05-07T20:25:58.3213393Z 
2025-05-07T20:25:58.3213398Z 
2025-05-07T20:25:58.3213403Z 
2025-05-07T20:25:58.3213408Z 
2025-05-07T20:25:58.3213650Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3213943Z 
2025-05-07T20:25:58.3213956Z 
2025-05-07T20:25:58.3214099Z [A
2025-05-07T20:25:58.3214238Z 
2025-05-07T20:25:58.3214243Z 
2025-05-07T20:25:58.3214385Z [A[A
2025-05-07T20:25:58.3214528Z 
2025-05-07T20:25:58.3214533Z 
2025-05-07T20:25:58.3214538Z 
2025-05-07T20:25:58.3214678Z [A[A[A
2025-05-07T20:25:58.3214833Z 
2025-05-07T20:25:58.3214838Z 
2025-05-07T20:25:58.3214844Z 
2025-05-07T20:25:58.3214849Z 
2025-05-07T20:25:58.3214996Z [A[A[A[A
2025-05-07T20:25:58.3215167Z 
2025-05-07T20:25:58.3215173Z 
2025-05-07T20:25:58.3215187Z 
2025-05-07T20:25:58.3215192Z 
2025-05-07T20:25:58.3215197Z 
2025-05-07T20:25:58.3215355Z [A[A[A[A[A
2025-05-07T20:25:58.3215531Z 
2025-05-07T20:25:58.3215542Z 
2025-05-07T20:25:58.3215548Z 
2025-05-07T20:25:58.3215553Z 
2025-05-07T20:25:58.3215559Z 
2025-05-07T20:25:58.3215564Z 
2025-05-07T20:25:58.3215719Z [A[A[A[A[A[A
2025-05-07T20:25:58.3215904Z 
2025-05-07T20:25:58.3215909Z 
2025-05-07T20:25:58.3215914Z 
2025-05-07T20:25:58.3215920Z 
2025-05-07T20:25:58.3215924Z 
2025-05-07T20:25:58.3215929Z 
2025-05-07T20:25:58.3215934Z 
2025-05-07T20:25:58.3216090Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3216289Z 
2025-05-07T20:25:58.3216295Z 
2025-05-07T20:25:58.3216300Z 
2025-05-07T20:25:58.3216305Z 
2025-05-07T20:25:58.3216310Z 
2025-05-07T20:25:58.3216315Z 
2025-05-07T20:25:58.3216320Z 
2025-05-07T20:25:58.3216325Z 
2025-05-07T20:25:58.3216484Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3216699Z 
2025-05-07T20:25:58.3216704Z 
2025-05-07T20:25:58.3216709Z 
2025-05-07T20:25:58.3216721Z 
2025-05-07T20:25:58.3216741Z 
2025-05-07T20:25:58.3216746Z 
2025-05-07T20:25:58.3216752Z 
2025-05-07T20:25:58.3216757Z 
2025-05-07T20:25:58.3216762Z 
2025-05-07T20:25:58.3216936Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3217153Z 
2025-05-07T20:25:58.3217159Z 
2025-05-07T20:25:58.3217164Z 
2025-05-07T20:25:58.3217169Z 
2025-05-07T20:25:58.3217175Z 
2025-05-07T20:25:58.3217180Z 
2025-05-07T20:25:58.3217185Z 
2025-05-07T20:25:58.3217190Z 
2025-05-07T20:25:58.3217194Z 
2025-05-07T20:25:58.3217199Z 
2025-05-07T20:25:58.3217379Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3217604Z 
2025-05-07T20:25:58.3217609Z 
2025-05-07T20:25:58.3217614Z 
2025-05-07T20:25:58.3217619Z 
2025-05-07T20:25:58.3217624Z 
2025-05-07T20:25:58.3217630Z 
2025-05-07T20:25:58.3217635Z 
2025-05-07T20:25:58.3217640Z 
2025-05-07T20:25:58.3217645Z 
2025-05-07T20:25:58.3217650Z 
2025-05-07T20:25:58.3217655Z 
2025-05-07T20:25:58.3217838Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3218074Z 
2025-05-07T20:25:58.3218085Z 
2025-05-07T20:25:58.3218090Z 
2025-05-07T20:25:58.3218095Z 
2025-05-07T20:25:58.3218100Z 
2025-05-07T20:25:58.3218105Z 
2025-05-07T20:25:58.3218110Z 
2025-05-07T20:25:58.3218120Z 
2025-05-07T20:25:58.3218125Z 
2025-05-07T20:25:58.3218130Z 
2025-05-07T20:25:58.3218143Z 
2025-05-07T20:25:58.3218148Z 
2025-05-07T20:25:58.3218329Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3218576Z 
2025-05-07T20:25:58.3218580Z 
2025-05-07T20:25:58.3218586Z 
2025-05-07T20:25:58.3218591Z 
2025-05-07T20:25:58.3218596Z 
2025-05-07T20:25:58.3218601Z 
2025-05-07T20:25:58.3218614Z 
2025-05-07T20:25:58.3218619Z 
2025-05-07T20:25:58.3218624Z 
2025-05-07T20:25:58.3218629Z 
2025-05-07T20:25:58.3218634Z 
2025-05-07T20:25:58.3218639Z 
2025-05-07T20:25:58.3218645Z 
2025-05-07T20:25:58.3218845Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3219110Z 
2025-05-07T20:25:58.3219116Z 
2025-05-07T20:25:58.3219121Z 
2025-05-07T20:25:58.3219126Z 
2025-05-07T20:25:58.3219131Z 
2025-05-07T20:25:58.3219136Z 
2025-05-07T20:25:58.3219256Z 
2025-05-07T20:25:58.3219261Z 
2025-05-07T20:25:58.3219266Z 
2025-05-07T20:25:58.3219271Z 
2025-05-07T20:25:58.3219276Z 
2025-05-07T20:25:58.3219281Z 
2025-05-07T20:25:58.3219379Z 
2025-05-07T20:25:58.3219385Z 
2025-05-07T20:25:58.3219598Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3219990Z 
2025-05-07T20:25:58.3219996Z 
2025-05-07T20:25:58.3220001Z 
2025-05-07T20:25:58.3220006Z 
2025-05-07T20:25:58.3220011Z 
2025-05-07T20:25:58.3220016Z 
2025-05-07T20:25:58.3220022Z 
2025-05-07T20:25:58.3220026Z 
2025-05-07T20:25:58.3220032Z 
2025-05-07T20:25:58.3220037Z 
2025-05-07T20:25:58.3220050Z 
2025-05-07T20:25:58.3220056Z 
2025-05-07T20:25:58.3220061Z 
2025-05-07T20:25:58.3220066Z 
2025-05-07T20:25:58.3220071Z 
2025-05-07T20:25:58.3220282Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3220554Z 
2025-05-07T20:25:58.3220559Z 
2025-05-07T20:25:58.3220564Z 
2025-05-07T20:25:58.3220576Z 
2025-05-07T20:25:58.3220581Z 
2025-05-07T20:25:58.3220586Z 
2025-05-07T20:25:58.3220600Z 
2025-05-07T20:25:58.3220605Z 
2025-05-07T20:25:58.3220610Z 
2025-05-07T20:25:58.3220615Z 
2025-05-07T20:25:58.3220620Z 
2025-05-07T20:25:58.3220625Z 
2025-05-07T20:25:58.3220636Z 
2025-05-07T20:25:58.3220641Z 
2025-05-07T20:25:58.3220647Z 
2025-05-07T20:25:58.3220652Z 
2025-05-07T20:25:58.3220865Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3221153Z 
2025-05-07T20:25:58.3221158Z 
2025-05-07T20:25:58.3221164Z 
2025-05-07T20:25:58.3221169Z 
2025-05-07T20:25:58.3221174Z 
2025-05-07T20:25:58.3221179Z 
2025-05-07T20:25:58.3221184Z 
2025-05-07T20:25:58.3221189Z 
2025-05-07T20:25:58.3221193Z 
2025-05-07T20:25:58.3221198Z 
2025-05-07T20:25:58.3221203Z 
2025-05-07T20:25:58.3221208Z 
2025-05-07T20:25:58.3221213Z 
2025-05-07T20:25:58.3221218Z 
2025-05-07T20:25:58.3221223Z 
2025-05-07T20:25:58.3221228Z 
2025-05-07T20:25:58.3221233Z 
2025-05-07T20:25:58.3221459Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3221746Z 
2025-05-07T20:25:58.3221752Z 
2025-05-07T20:25:58.3221764Z 
2025-05-07T20:25:58.3221769Z 
2025-05-07T20:25:58.3221774Z 
2025-05-07T20:25:58.3221778Z 
2025-05-07T20:25:58.3221783Z 
2025-05-07T20:25:58.3221789Z 
2025-05-07T20:25:58.3221808Z 
2025-05-07T20:25:58.3221813Z 
2025-05-07T20:25:58.3221818Z 
2025-05-07T20:25:58.3221823Z 
2025-05-07T20:25:58.3221828Z 
2025-05-07T20:25:58.3221833Z 
2025-05-07T20:25:58.3221838Z 
2025-05-07T20:25:58.3221843Z 
2025-05-07T20:25:58.3221848Z 
2025-05-07T20:25:58.3221854Z 
2025-05-07T20:25:58.3222083Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3222387Z 
2025-05-07T20:25:58.3222392Z 
2025-05-07T20:25:58.3222569Z [A
2025-05-07T20:25:58.3222715Z 
2025-05-07T20:25:58.3222721Z 
2025-05-07T20:25:58.3222862Z [A[A
2025-05-07T20:25:58.3223017Z 
2025-05-07T20:25:58.3223022Z 
2025-05-07T20:25:58.3223028Z 
2025-05-07T20:25:58.3223171Z [A[A[A
2025-05-07T20:25:58.3223327Z 
2025-05-07T20:25:58.3223332Z 
2025-05-07T20:25:58.3223337Z 
2025-05-07T20:25:58.3223342Z 
2025-05-07T20:25:58.3223483Z [A[A[A[A
2025-05-07T20:25:58.3223646Z 
2025-05-07T20:25:58.3223651Z 
2025-05-07T20:25:58.3223656Z 
2025-05-07T20:25:58.3223661Z 
2025-05-07T20:25:58.3223673Z 
2025-05-07T20:25:58.3223824Z [A[A[A[A[A
2025-05-07T20:25:58.3223989Z 
2025-05-07T20:25:58.3223994Z 
2025-05-07T20:25:58.3223999Z 
2025-05-07T20:25:58.3224004Z 
2025-05-07T20:25:58.3224009Z 
2025-05-07T20:25:58.3224014Z 
2025-05-07T20:25:58.3224171Z [A[A[A[A[A[A
2025-05-07T20:25:58.3224342Z 
2025-05-07T20:25:58.3224347Z 
2025-05-07T20:25:58.3224352Z 
2025-05-07T20:25:58.3224357Z 
2025-05-07T20:25:58.3224362Z 
2025-05-07T20:25:58.3224368Z 
2025-05-07T20:25:58.3224373Z 
2025-05-07T20:25:58.3224539Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3224728Z 
2025-05-07T20:25:58.3224734Z 
2025-05-07T20:25:58.3224739Z 
2025-05-07T20:25:58.3224744Z 
2025-05-07T20:25:58.3224750Z 
2025-05-07T20:25:58.3224755Z 
2025-05-07T20:25:58.3224760Z 
2025-05-07T20:25:58.3224765Z 
2025-05-07T20:25:58.3224937Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3225138Z 
2025-05-07T20:25:58.3225289Z 
2025-05-07T20:25:58.3225294Z 
2025-05-07T20:25:58.3225299Z 
2025-05-07T20:25:58.3225304Z 
2025-05-07T20:25:58.3225309Z 
2025-05-07T20:25:58.3225394Z 
2025-05-07T20:25:58.3225400Z 
2025-05-07T20:25:58.3225405Z 
2025-05-07T20:25:58.3225586Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3225803Z 
2025-05-07T20:25:58.3225809Z 
2025-05-07T20:25:58.3225814Z 
2025-05-07T20:25:58.3225819Z 
2025-05-07T20:25:58.3225824Z 
2025-05-07T20:25:58.3225829Z 
2025-05-07T20:25:58.3225834Z 
2025-05-07T20:25:58.3225839Z 
2025-05-07T20:25:58.3225852Z 
2025-05-07T20:25:58.3225857Z 
2025-05-07T20:25:58.3226034Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3226258Z 
2025-05-07T20:25:58.3226263Z 
2025-05-07T20:25:58.3226268Z 
2025-05-07T20:25:58.3226273Z 
2025-05-07T20:25:58.3226278Z 
2025-05-07T20:25:58.3226290Z 
2025-05-07T20:25:58.3226295Z 
2025-05-07T20:25:58.3226300Z 
2025-05-07T20:25:58.3226305Z 
2025-05-07T20:25:58.3226310Z 
2025-05-07T20:25:58.3226315Z 
2025-05-07T20:25:58.3226489Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3226731Z 
2025-05-07T20:25:58.3226746Z 
2025-05-07T20:25:58.3226751Z 
2025-05-07T20:25:58.3226756Z 
2025-05-07T20:25:58.3226761Z 
2025-05-07T20:25:58.3226773Z 
2025-05-07T20:25:58.3226778Z 
2025-05-07T20:25:58.3226783Z 
2025-05-07T20:25:58.3226788Z 
2025-05-07T20:25:58.3226794Z 
2025-05-07T20:25:58.3226799Z 
2025-05-07T20:25:58.3226804Z 
2025-05-07T20:25:58.3226984Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3227238Z 
2025-05-07T20:25:58.3227243Z 
2025-05-07T20:25:58.3227248Z 
2025-05-07T20:25:58.3227253Z 
2025-05-07T20:25:58.3227258Z 
2025-05-07T20:25:58.3227263Z 
2025-05-07T20:25:58.3227268Z 
2025-05-07T20:25:58.3227274Z 
2025-05-07T20:25:58.3227279Z 
2025-05-07T20:25:58.3227284Z 
2025-05-07T20:25:58.3227289Z 
2025-05-07T20:25:58.3227294Z 
2025-05-07T20:25:58.3227299Z 
2025-05-07T20:25:58.3227487Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3227745Z 
2025-05-07T20:25:58.3227750Z 
2025-05-07T20:25:58.3227755Z 
2025-05-07T20:25:58.3227760Z 
2025-05-07T20:25:58.3227772Z 
2025-05-07T20:25:58.3227777Z 
2025-05-07T20:25:58.3227782Z 
2025-05-07T20:25:58.3227788Z 
2025-05-07T20:25:58.3227792Z 
2025-05-07T20:25:58.3227801Z 
2025-05-07T20:25:58.3227804Z 
2025-05-07T20:25:58.3227808Z 
2025-05-07T20:25:58.3227812Z 
2025-05-07T20:25:58.3227815Z 
2025-05-07T20:25:58.3227967Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3228154Z 
2025-05-07T20:25:58.3228158Z 
2025-05-07T20:25:58.3228161Z 
2025-05-07T20:25:58.3228165Z 
2025-05-07T20:25:58.3228169Z 
2025-05-07T20:25:58.3228173Z 
2025-05-07T20:25:58.3228176Z 
2025-05-07T20:25:58.3228180Z 
2025-05-07T20:25:58.3228190Z 
2025-05-07T20:25:58.3228194Z 
2025-05-07T20:25:58.3228198Z 
2025-05-07T20:25:58.3228201Z 
2025-05-07T20:25:58.3228205Z 
2025-05-07T20:25:58.3228208Z 
2025-05-07T20:25:58.3228212Z 
2025-05-07T20:25:58.3228382Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3228574Z 
2025-05-07T20:25:58.3228578Z 
2025-05-07T20:25:58.3228581Z 
2025-05-07T20:25:58.3228585Z 
2025-05-07T20:25:58.3228592Z 
2025-05-07T20:25:58.3228595Z 
2025-05-07T20:25:58.3228599Z 
2025-05-07T20:25:58.3228602Z 
2025-05-07T20:25:58.3228606Z 
2025-05-07T20:25:58.3228614Z 
2025-05-07T20:25:58.3228617Z 
2025-05-07T20:25:58.3228626Z 
2025-05-07T20:25:58.3228630Z 
2025-05-07T20:25:58.3228633Z 
2025-05-07T20:25:58.3228637Z 
2025-05-07T20:25:58.3228640Z 
2025-05-07T20:25:58.3228789Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3228986Z 
2025-05-07T20:25:58.3228990Z 
2025-05-07T20:25:58.3228994Z 
2025-05-07T20:25:58.3229003Z 
2025-05-07T20:25:58.3229006Z 
2025-05-07T20:25:58.3229010Z 
2025-05-07T20:25:58.3229013Z 
2025-05-07T20:25:58.3229017Z 
2025-05-07T20:25:58.3229021Z 
2025-05-07T20:25:58.3229024Z 
2025-05-07T20:25:58.3229028Z 
2025-05-07T20:25:58.3229031Z 
2025-05-07T20:25:58.3229035Z 
2025-05-07T20:25:58.3229038Z 
2025-05-07T20:25:58.3229042Z 
2025-05-07T20:25:58.3229045Z 
2025-05-07T20:25:58.3229049Z 
2025-05-07T20:25:58.3229201Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3229506Z 
2025-05-07T20:25:58.3229510Z 
2025-05-07T20:25:58.3229513Z 
2025-05-07T20:25:58.3229517Z 
2025-05-07T20:25:58.3229521Z 
2025-05-07T20:25:58.3229645Z 
2025-05-07T20:25:58.3229649Z 
2025-05-07T20:25:58.3229653Z 
2025-05-07T20:25:58.3229656Z 
2025-05-07T20:25:58.3229660Z 
2025-05-07T20:25:58.3229664Z 
2025-05-07T20:25:58.3229667Z 
2025-05-07T20:25:58.3229671Z 
2025-05-07T20:25:58.3229674Z 
2025-05-07T20:25:58.3229678Z 
2025-05-07T20:25:58.3229682Z 
2025-05-07T20:25:58.3229685Z 
2025-05-07T20:25:58.3229695Z 
2025-05-07T20:25:58.3229859Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3230062Z 
2025-05-07T20:25:58.3230066Z 
2025-05-07T20:25:58.3230167Z [A
2025-05-07T20:25:58.3230266Z 
2025-05-07T20:25:58.3230270Z 
2025-05-07T20:25:58.3230366Z [A[A
2025-05-07T20:25:58.3230477Z 
2025-05-07T20:25:58.3230481Z 
2025-05-07T20:25:58.3230485Z 
2025-05-07T20:25:58.3230586Z [A[A[A
2025-05-07T20:25:58.3230736Z 
2025-05-07T20:25:58.3230741Z 
2025-05-07T20:25:58.3230746Z 
2025-05-07T20:25:58.3230759Z 
2025-05-07T20:25:58.3230921Z [A[A[A[A
2025-05-07T20:25:58.3231091Z 
2025-05-07T20:25:58.3231096Z 
2025-05-07T20:25:58.3231110Z 
2025-05-07T20:25:58.3231123Z 
2025-05-07T20:25:58.3231128Z 
2025-05-07T20:25:58.3231282Z [A[A[A[A[A
2025-05-07T20:25:58.3231450Z 
2025-05-07T20:25:58.3231455Z 
2025-05-07T20:25:58.3231460Z 
2025-05-07T20:25:58.3231465Z 
2025-05-07T20:25:58.3231470Z 
2025-05-07T20:25:58.3231482Z 
2025-05-07T20:25:58.3231632Z [A[A[A[A[A[A
2025-05-07T20:25:58.3231804Z 
2025-05-07T20:25:58.3231809Z 
2025-05-07T20:25:58.3231814Z 
2025-05-07T20:25:58.3231819Z 
2025-05-07T20:25:58.3231824Z 
2025-05-07T20:25:58.3231829Z 
2025-05-07T20:25:58.3231834Z 
2025-05-07T20:25:58.3231996Z [A[A[A[A[A[A[A
2025-05-07T20:25:58.3232137Z 
2025-05-07T20:25:58.3232141Z 
2025-05-07T20:25:58.3232144Z 
2025-05-07T20:25:58.3232148Z 
2025-05-07T20:25:58.3232152Z 
2025-05-07T20:25:58.3232156Z 
2025-05-07T20:25:58.3232159Z 
2025-05-07T20:25:58.3232163Z 
2025-05-07T20:25:58.3232288Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3232439Z 
2025-05-07T20:25:58.3232442Z 
2025-05-07T20:25:58.3232446Z 
2025-05-07T20:25:58.3232449Z 
2025-05-07T20:25:58.3232456Z 
2025-05-07T20:25:58.3232460Z 
2025-05-07T20:25:58.3232464Z 
2025-05-07T20:25:58.3232473Z 
2025-05-07T20:25:58.3232477Z 
2025-05-07T20:25:58.3232599Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3232752Z 
2025-05-07T20:25:58.3232756Z 
2025-05-07T20:25:58.3232760Z 
2025-05-07T20:25:58.3232763Z 
2025-05-07T20:25:58.3232767Z 
2025-05-07T20:25:58.3232771Z 
2025-05-07T20:25:58.3232781Z 
2025-05-07T20:25:58.3232785Z 
2025-05-07T20:25:58.3232788Z 
2025-05-07T20:25:58.3232792Z 
2025-05-07T20:25:58.3232917Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.3233079Z 
2025-05-07T20:25:58.3233082Z 
2025-05-07T20:25:58.3233086Z 
2025-05-07T20:25:58.3233096Z 
2025-05-07T20:25:58.3233100Z 
2025-05-07T20:25:58.3233104Z 
2025-05-07T20:25:58.3233107Z 
2025-05-07T20:25:58.3233111Z 
2025-05-07T20:25:58.3233114Z 
2025-05-07T20:25:58.3233118Z 
2025-05-07T20:25:58.3233125Z 
2025-05-07T20:25:58.3233262Z [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:58.6444232Z Preparing transaction: - \ | done
2025-05-07T20:26:00.0915543Z Verifying transaction: - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:00.9317202Z Executing transaction: / - \ | / - \ | done
2025-05-07T20:26:03.2862313Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:03.2862857Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:03.2863803Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:03.2864585Z 
2025-05-07T20:26:03.2876838Z 
2025-05-07T20:26:03.2877763Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:03.2878479Z 
2025-05-07T20:26:03.2891392Z 
2025-05-07T20:26:03.2891924Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:03.2896173Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:03.2900056Z 
2025-05-07T20:26:03.4479949Z 
2025-05-07T20:26:03.4486728Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:03.4490847Z 
2025-05-07T20:26:03.4512679Z 
2025-05-07T20:26:03.4513043Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:03.4892201Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:05.3802701Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:05.4449141Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:05.4449648Z 
2025-05-07T20:26:05.8702635Z 
2025-05-07T20:26:05.8711925Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:05.9063305Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:05.9063794Z 
2025-05-07T20:26:06.3421154Z 
2025-05-07T20:26:06.3421697Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:06.3422694Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:06.3423392Z 
2025-05-07T20:26:06.7700795Z 
2025-05-07T20:26:08.8036812Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:10.8270397Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:12.8506326Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:12.8507129Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:14.8775480Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:16.7792339Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:16.7794057Z 
2025-05-07T20:26:16.8427169Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:20.6893867Z /tmp/tmpxr5mqe6j: line 3: clang: command not found
2025-05-07T20:26:20.6894160Z 
2025-05-07T20:26:20.6894744Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:20.7525293Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:20.7525619Z 
2025-05-07T20:26:20.7546373Z total 36
2025-05-07T20:26:20.7546754Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:20.7547290Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:20.7547854Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:20.7548358Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:20.7549215Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:20.7549698Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:20.7550141Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:20.7550595Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:20.7550882Z 
2025-05-07T20:26:20.7551097Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:20.7551732Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:20.7552153Z 
2025-05-07T20:26:20.7574059Z 
2025-05-07T20:26:20.7574370Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:20.7574624Z 
2025-05-07T20:26:22.7159080Z 
2025-05-07T20:26:22.7159719Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:22.7160277Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:22.7160676Z 
2025-05-07T20:26:23.1471063Z 
2025-05-07T20:26:23.1471457Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:23.1471726Z 
2025-05-07T20:26:25.0446515Z -allow-unsupported-compiler
2025-05-07T20:26:25.0446836Z 
2025-05-07T20:26:25.1085911Z 
2025-05-07T20:26:25.1086677Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:25.1087455Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:25.1087876Z 
2025-05-07T20:26:27.0726803Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:27.0727571Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:27.0727997Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:27.0728319Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:27.0728639Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:27.0728896Z #define _STL_PAIR_H 1
2025-05-07T20:26:27.0729174Z #define __cpp_attributes 200809L
2025-05-07T20:26:27.0729499Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:27.0729840Z #define __DELETE_THROW throw()
2025-05-07T20:26:27.0730098Z #define _PTRDIFF_T_ 
2025-05-07T20:26:27.0730336Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:27.0730700Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:27.0731096Z #define _IO_LEFT 02
2025-05-07T20:26:27.0731426Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:27.0731689Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:27.0732062Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:27.0732660Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:27.0733254Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:27.0733548Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:27.0733805Z #define _IOS_OUTPUT 2
2025-05-07T20:26:27.0734101Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:27.0734458Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:27.0735091Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:27.0735364Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:27.0735888Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:27.0736987Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:27.0738151Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:27.0738575Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:27.0739009Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:27.0739485Z #define _T_WCHAR_ 
2025-05-07T20:26:27.0739796Z #define stdout stdout
2025-05-07T20:26:27.0740458Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:27.0740979Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:27.0741333Z #define __flexarr []
2025-05-07T20:26:27.0741673Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:27.0742114Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:27.0742635Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:27.0743055Z #define _MATH_H 1
2025-05-07T20:26:27.0743441Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:27.0743919Z #define __S64_TYPE long int
2025-05-07T20:26:27.0744265Z #define __stub_fchflags 
2025-05-07T20:26:27.0744631Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:27.0745028Z #define __SQUAD_TYPE long int
2025-05-07T20:26:27.0745395Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:27.0745763Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:27.0746110Z #define NL_NMAX INT_MAX
2025-05-07T20:26:27.0746456Z #define _BITS_TIME_H 1
2025-05-07T20:26:27.0746843Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:27.0747278Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:27.0747709Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:27.0748200Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:27.0748749Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:27.0749250Z #define __CHAR_BIT__ 8
2025-05-07T20:26:27.0749606Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.0750038Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:27.0750433Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:27.0750798Z #define FP_NAN 0
2025-05-07T20:26:27.0751197Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:27.0751802Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:27.0752483Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:27.0752985Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:27.0753277Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:27.0753548Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:27.0753805Z #define __SM_80_RT_H__ 
2025-05-07T20:26:27.0754026Z #define _NEW 
2025-05-07T20:26:27.0754256Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:27.0754550Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:27.0754961Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:27.0755413Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:27.0755658Z #define __USE_ANSI 1
2025-05-07T20:26:27.0755948Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:27.0756460Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:27.0756825Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:27.0757137Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:27.0757517Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:27.0757896Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:27.0758284Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:27.0758634Z #define PIPE_BUF 4096
2025-05-07T20:26:27.0758962Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:27.0759321Z #define ADJ_TICK 0x4000
2025-05-07T20:26:27.0759606Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:27.0760081Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:27.0760342Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:27.0760746Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:27.0761198Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:27.0761725Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:27.0762091Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:27.0762340Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:27.0762618Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.0762904Z #define __cpp_static_assert 201411L
2025-05-07T20:26:27.0763241Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:27.0763583Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:27.0763864Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:27.0764148Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:27.0764445Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:27.0764733Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:27.0765033Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.0765389Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:27.0765736Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:27.0766023Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:27.0766330Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.0766691Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:27.0767046Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:27.0767334Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:27.0767631Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:27.0767963Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:27.0768287Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:27.0768680Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:27.0769095Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:27.0769402Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:27.0769663Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:27.0769952Z #define __GCC_IEC_559 2
2025-05-07T20:26:27.0770248Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:27.0770581Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:27.0770848Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:27.0771120Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:27.0771379Z #define _IOFBF 0
2025-05-07T20:26:27.0771601Z #define __USE_BSD 1
2025-05-07T20:26:27.0771832Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:27.0772104Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:27.0772374Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:27.0772632Z #define _IO_NO_WRITES 8
2025-05-07T20:26:27.0772887Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:27.0773235Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:27.0773583Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:27.0773897Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:27.0774207Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:27.0774505Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:27.0774778Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:27.0775042Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:27.0775355Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:27.0775735Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:27.0776097Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:27.0776400Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:27.0776708Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:27.0777039Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:27.0777336Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:27.0777636Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:27.0777913Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:27.0778176Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:27.0778938Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:27.0779783Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:27.0780298Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:27.0780614Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:27.0780912Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:27.0781216Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:27.0781506Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:27.0781818Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:27.0782148Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:27.0782441Z #define RAND_MAX 2147483647
2025-05-07T20:26:27.0782913Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:27.0794572Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.0794988Z #define __SM_90_RT_H__ 
2025-05-07T20:26:27.0795246Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:27.0795538Z #define __COMPAR_FN_T 
2025-05-07T20:26:27.0795779Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:27.0796059Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:27.0796538Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:27.0797055Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:27.0797406Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:27.0797774Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:27.0798073Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:27.0798416Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:27.0798741Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:27.0799258Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:27.0799799Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:27.0800138Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:27.0800425Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:27.0800722Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:27.0801034Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:27.0801308Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:27.0801578Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:27.0801850Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:27.0802109Z #define __u_char_defined 
2025-05-07T20:26:27.0802427Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:27.0802791Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:27.0803058Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:27.0803310Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:27.0803599Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:27.0804230Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:27.0804660Z #define FP_INFINITE 1
2025-05-07T20:26:27.0805027Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:27.0805454Z #define _IO_pid_t __pid_t
2025-05-07T20:26:27.0805717Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:27.0805979Z #define __LEAF , __leaf__
2025-05-07T20:26:27.0806389Z #define PATH_MAX 4096
2025-05-07T20:26:27.0806649Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:27.0806993Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:27.0807313Z #define _LIMITS_H___ 
2025-05-07T20:26:27.0807538Z #define __size_t 
2025-05-07T20:26:27.0807771Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:27.0808315Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:27.0808967Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:27.0809280Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:27.0809616Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:27.0809884Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:27.0810241Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:27.0811034Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:27.0811336Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:27.0811842Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:27.0812137Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:27.0812425Z #define __INT8_C(c) c
2025-05-07T20:26:27.0812683Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:27.0812988Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:27.0813255Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:27.0813510Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:27.0813765Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:27.0814045Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:27.0814370Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.0814693Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:27.0814970Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:27.0815246Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:27.0815509Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:27.0815836Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:27.0816141Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:27.0816507Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:27.0816890Z #define NFDBITS __NFDBITS
2025-05-07T20:26:27.0817154Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:27.0817442Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:27.0817766Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:27.0818085Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:27.0818344Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:27.0818638Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:27.0818944Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:27.0819265Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:27.0819679Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:27.0820174Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:27.0820468Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:27.0820789Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:27.0821166Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:27.0821504Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:27.0821819Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:27.0822149Z #define __daddr_t_defined 
2025-05-07T20:26:27.0822406Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:27.0822678Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:27.0822997Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:27.0823506Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:27.0823985Z #define _ACRTIMP 
2025-05-07T20:26:27.0824204Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:27.0824472Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:27.0824859Z #define _IOS_BIN 128
2025-05-07T20:26:27.0825339Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:27.0825889Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.0826162Z #define UNDERFLOW 4
2025-05-07T20:26:27.0826387Z #define NAME_MAX 255
2025-05-07T20:26:27.0826628Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:27.0826901Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:27.0827175Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:27.0827469Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:27.0827848Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:27.0828236Z #define __ptr_t void *
2025-05-07T20:26:27.0828471Z #define M_E 2.7182818284590452354
2025-05-07T20:26:27.0828756Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:27.0829027Z #define __USE_ISOCXX11 1
2025-05-07T20:26:27.0829294Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:27.0829619Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:27.0829922Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:27.0830191Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:27.0830627Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:27.0830946Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:27.0831287Z #define __linux 1
2025-05-07T20:26:27.0831522Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:27.0831800Z #define cudaDeviceMask 0xff
2025-05-07T20:26:27.0832074Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:27.0832364Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:27.0832651Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:27.0832943Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:27.0833243Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:27.0833550Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:27.0833849Z #define _BITS_TYPES_H 1
2025-05-07T20:26:27.0834135Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:27.0834478Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:27.0834786Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:27.0835061Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:27.0835357Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:27.0835649Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:27.0836436Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:27.0837233Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:27.0837516Z #define __unix 1
2025-05-07T20:26:27.0837742Z #define MATH_ERRNO 1
2025-05-07T20:26:27.0837983Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:27.0838266Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:27.0838540Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:27.0838820Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:27.0839113Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:27.0839401Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:27.0840024Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:27.0840499Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:27.0840803Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:27.0841070Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:27.0841339Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:27.0841629Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:27.0841897Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:27.0842133Z #define __SIZE_T 
2025-05-07T20:26:27.0842390Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:27.0842707Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:27.0843000Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:27.0843267Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:27.0843536Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:27.0843920Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:27.0844352Z #define __WAIT_STATUS void *
2025-05-07T20:26:27.0844620Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:27.0844893Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:27.0845162Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:27.0845449Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:27.0845732Z #define __WINT_MIN__ 0U
2025-05-07T20:26:27.0846305Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:27.0846946Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:27.0847247Z #define WUNTRACED 2
2025-05-07T20:26:27.0847476Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:27.0847758Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:27.0848045Z #define NZERO 20
2025-05-07T20:26:27.0848275Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:27.0848557Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:27.0848856Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:27.0849150Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:27.0849403Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:27.0849690Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:27.0850074Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:27.0850351Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:27.0850714Z #define EXIT_FAILURE 1
2025-05-07T20:26:27.0850962Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:27.0851223Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:27.0851498Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:27.0851757Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:27.0852040Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:27.0852383Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:27.0852760Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:27.0853062Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:27.0853321Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:27.0853599Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:27.0853892Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:27.0854201Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:27.0854494Z #define SEEK_DATA 3
2025-05-07T20:26:27.0854730Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:27.0855031Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:27.0855460Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:27.0855848Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:27.0856104Z #define __INT64_C(c) c ## L
2025-05-07T20:26:27.0856379Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:27.0856709Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:27.0857100Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:27.0857464Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:27.0857774Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:27.0858071Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:27.0858332Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:27.0858576Z #define WSTOPPED 2
2025-05-07T20:26:27.0858839Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:27.0859195Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:27.0859455Z #define FP_NORMAL 4
2025-05-07T20:26:27.0859714Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:27.0860103Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:27.0860441Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:27.0860705Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:27.0860994Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:27.0861310Z #define cudaTextureType1D 0x01
2025-05-07T20:26:27.0861635Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:27.0861914Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:27.0862197Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:27.0862494Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:27.0862933Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:27.0863393Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:27.0863669Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:27.0863932Z #define _POSIX_SOURCE 1
2025-05-07T20:26:27.0864191Z #define cudaTextureType2D 0x02
2025-05-07T20:26:27.0864464Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:27.0864747Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:27.0865071Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:27.0865352Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:27.0865678Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:27.0866036Z #define cudaTextureType3D 0x03
2025-05-07T20:26:27.0866323Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:27.0866589Z #define CLOCK_REALTIME 0
2025-05-07T20:26:27.0866850Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:27.0867142Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:27.0867459Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:27.0867750Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:27.0868048Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:27.0868524Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:27.0868807Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:27.0869124Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:27.0869436Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:27.0869876Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:27.0870164Z #define __GLIBC__ 2
2025-05-07T20:26:27.0870486Z #define __END_DECLS }
2025-05-07T20:26:27.0870754Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:27.0871176Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:27.0871625Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:27.0871904Z #define WCONTINUED 8
2025-05-07T20:26:27.0872165Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:27.0872458Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:27.0872772Z #define _ALLOCA_H 1
2025-05-07T20:26:27.0873028Z #define __host__ __location__(host)
2025-05-07T20:26:27.0873525Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:27.0874051Z #define __SLONG32_TYPE int
2025-05-07T20:26:27.0874350Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:27.0874678Z #define _SYS_SELECT_H 1
2025-05-07T20:26:27.0874955Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:27.0875239Z #define _IOS_NOCREATE 32
2025-05-07T20:26:27.0875525Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:27.0875849Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:27.0876179Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:27.0876511Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:27.0876848Z #define __global__ __location__(global)
2025-05-07T20:26:27.0877176Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:27.0877469Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:27.0877771Z #define __DBL_DIG__ 15
2025-05-07T20:26:27.0878022Z #define TIME_UTC 1
2025-05-07T20:26:27.0878253Z #define __FLT32_DIG__ 6
2025-05-07T20:26:27.0878597Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:27.0879010Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:27.0879338Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:27.0879671Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:27.0879986Z #define _G_BUFSIZ 8192
2025-05-07T20:26:27.0880305Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:27.0880693Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:27.0881009Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:27.0881303Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:27.0881607Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:27.0881868Z #define __GXX_WEAK__ 1
2025-05-07T20:26:27.0882131Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.0882447Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:27.0882725Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:27.0883034Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:27.0883381Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:27.0883674Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:27.0883982Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:27.0884290Z #define _G_config_h 1
2025-05-07T20:26:27.0884586Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:27.0884935Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:27.0885229Z #define _GCC_WCHAR_T 
2025-05-07T20:26:27.0885478Z #define TMP_MAX 238328
2025-05-07T20:26:27.0885739Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:27.0886016Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:27.0886292Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.0886588Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:27.0886875Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:27.0887177Z #define _IO_SKIPWS 01
2025-05-07T20:26:27.0887596Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:27.0888074Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:27.0888353Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:27.0888703Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:27.0889080Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:27.0889453Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:27.0890104Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:27.0890691Z #define le32toh(x) (x)
2025-05-07T20:26:27.0890932Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:27.0891328Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:27.0891685Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:27.0892041Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:27.0892450Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:27.0892874Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:27.0893156Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:27.0893429Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:27.0893712Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:27.0894006Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:27.0894541Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:27.0895057Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:27.0895383Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:27.0895747Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:27.0896078Z #define _WCHAR_T_ 
2025-05-07T20:26:27.0896329Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:27.0896708Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:27.0897100Z #define RTSIG_MAX 32
2025-05-07T20:26:27.0897340Z #define _STDDEF_H 
2025-05-07T20:26:27.0897586Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:27.0897867Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:27.0898134Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:27.0898482Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:27.0898878Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:27.0899221Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:27.0899536Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:27.0900141Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:27.0900693Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:27.0901080Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:27.0901420Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:27.0901738Z #define __unix__ 1
2025-05-07T20:26:27.0901989Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.0902286Z #define __INT_WIDTH__ 32
2025-05-07T20:26:27.0902541Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:27.0902791Z #define _IONBF 2
2025-05-07T20:26:27.0903247Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:27.0904016Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:27.0904566Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:27.0904841Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:27.0905126Z #define __UINT16_C(c) c
2025-05-07T20:26:27.0905375Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:27.0905667Z #define STA_DEL 0x0020
2025-05-07T20:26:27.0905925Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:27.0906188Z #define __id_t_defined 
2025-05-07T20:26:27.0906476Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:27.0906935Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:27.0907373Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:27.0907651Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:27.0907925Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:27.0908185Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:27.0908468Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:27.0908748Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:27.0909020Z #define SING 2
2025-05-07T20:26:27.0909252Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:27.0909537Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.0909850Z #define cudaStreamDefault 0x00
2025-05-07T20:26:27.0910203Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:27.0910715Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:27.0911026Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:27.0911401Z #define __gnu_linux__ 1
2025-05-07T20:26:27.0911652Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:27.0911921Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:27.0912177Z #define MAX_INPUT 255
2025-05-07T20:26:27.0912434Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:27.0912775Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:27.0913153Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:27.0913483Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:27.0913820Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:27.0914238Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:27.0914665Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:27.0915028Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:27.0915400Z #define _Mfloat_ float
2025-05-07T20:26:27.0915678Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:27.0916004Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:27.0916313Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:27.0916814Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:27.0917321Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.0917614Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:27.0917956Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:27.0918322Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:27.0918631Z #define __USE_ISOC11 1
2025-05-07T20:26:27.0918877Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:27.0919117Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:27.0928457Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:27.0928795Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:27.0929115Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:27.0929450Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:27.0929773Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:27.0930113Z #define __THROW throw ()
2025-05-07T20:26:27.0930387Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:27.0930681Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.0931044Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:27.0931404Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:27.0931691Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:27.0931959Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:27.0932233Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:27.0932503Z #define L_tmpnam 20
2025-05-07T20:26:27.0932736Z #define ___int_wchar_t_h 
2025-05-07T20:26:27.0933090Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:27.0933484Z #define isascii(c) __isascii (c)
2025-05-07T20:26:27.0933748Z #define _T_PTRDIFF 
2025-05-07T20:26:27.0934078Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:27.0934455Z #define toascii(c) __toascii (c)
2025-05-07T20:26:27.0934715Z #define __GNUC__ 11
2025-05-07T20:26:27.0934986Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:27.0935295Z #define __GXX_RTTI 1
2025-05-07T20:26:27.0935526Z #define __pie__ 2
2025-05-07T20:26:27.0935751Z #define __MMX__ 1
2025-05-07T20:26:27.0935986Z #define __cudaCDP2Malloc 
2025-05-07T20:26:27.0936247Z #define __timespec_defined 1
2025-05-07T20:26:27.0936499Z #define L_ctermid 9
2025-05-07T20:26:27.0936740Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:27.0937056Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:27.0937445Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:27.0937824Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:27.0938100Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:27.0938395Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:27.0938706Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:27.0939024Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:27.0939614Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:27.0940379Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:27.0941147Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:27.0941755Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:27.0942065Z #define __USE_SVID 1
2025-05-07T20:26:27.0942324Z #define __constant__ __location__(constant)
2025-05-07T20:26:27.0942642Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:27.0942946Z #define __device__ __location__(device)
2025-05-07T20:26:27.0943269Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:27.0943598Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:27.0943869Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:27.0944147Z #define CUDART_DEVICE __device__
2025-05-07T20:26:27.0944504Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:27.0944882Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:27.0945164Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:27.0945545Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:27.0945931Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:27.0946195Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:27.0946555Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:27.0946979Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:27.0947299Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:27.0947568Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:27.0947838Z #define NGROUPS_MAX 65536
2025-05-07T20:26:27.0948097Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:27.0948359Z #define __USE_ISOC95 1
2025-05-07T20:26:27.0948591Z #define _TIME_H 1
2025-05-07T20:26:27.0948865Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:27.0949179Z #define __USE_ISOC99 1
2025-05-07T20:26:27.0949517Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:27.0949887Z #define HOST_NAME_MAX 64
2025-05-07T20:26:27.0950144Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:27.0950410Z #define _IOS_ATEND 4
2025-05-07T20:26:27.0950650Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:27.0950979Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:27.0951377Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:27.0951724Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:27.0952054Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:27.0952474Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:27.0952841Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:27.0953113Z #define _STDIO_H 1
2025-05-07T20:26:27.0953509Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:27.0953980Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:27.0954346Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:27.0954731Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:27.0955027Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:27.0955300Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:27.0955578Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:27.0955872Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:27.0956184Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.0956506Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:27.0956780Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:27.0957066Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:27.0957378Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:27.0957651Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:27.0957947Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:27.0958308Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:27.0958674Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:27.0959075Z #define __USE_XOPEN 1
2025-05-07T20:26:27.0959321Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:27.0959972Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:27.0960415Z #define __USE_XOPEN2K 1
2025-05-07T20:26:27.0960662Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:27.0960933Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:27.0961229Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:27.0961507Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:27.0962102Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:27.0962622Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:27.0962910Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:27.0963401Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:27.0963926Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:27.0964413Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:27.0964909Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:27.0965197Z #define __glibcxx_integral_traps true
2025-05-07T20:26:27.0965483Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:27.0965748Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:27.0966010Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:27.0966274Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:27.0966533Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:27.0966827Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:27.0967127Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:27.0967494Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:27.0967882Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:27.0968154Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:27.0968429Z #define _IO_UNITBUF 020000
2025-05-07T20:26:27.0968688Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:27.0968951Z #define __FD_SETSIZE 1024
2025-05-07T20:26:27.0969201Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:27.0969480Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:27.0969827Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:27.0970186Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:27.0970456Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:27.0970772Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:27.0971092Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:27.0971420Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:27.0971728Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:27.0972053Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:27.0972344Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:27.0972671Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:27.0972962Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:27.0973234Z #define __USE_POSIX199506 1
2025-05-07T20:26:27.0973484Z #define _FEATURES_H 1
2025-05-07T20:26:27.0973727Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:27.0974117Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:27.0974535Z #define __stub_getmsg 
2025-05-07T20:26:27.0974774Z #define _IO_FIXED 010000
2025-05-07T20:26:27.0975050Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:27.0975364Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:27.0975640Z #define __stub_setlogin 
2025-05-07T20:26:27.0975882Z #define __stub_fattach 
2025-05-07T20:26:27.0976127Z #define __cplusplus 201703L
2025-05-07T20:26:27.0976394Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:27.0976678Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:27.0976944Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:27.0977221Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:27.0977707Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:27.0978223Z #define _IO_INTERNAL 010
2025-05-07T20:26:27.0978480Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:27.0978820Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:27.0979169Z #define __dev_t_defined 
2025-05-07T20:26:27.0979597Z #define __DEPRECATED 1
2025-05-07T20:26:27.0979972Z #define __S32_TYPE int
2025-05-07T20:26:27.0980371Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:27.0980682Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:27.0980948Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:27.0981202Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:27.0981809Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:27.0982441Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:27.0982761Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:27.0983106Z #define OVERFLOW 3
2025-05-07T20:26:27.0983358Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:27.0983673Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:27.0983959Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.0984301Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:27.0984631Z #define __SSE2_MATH__ 1
2025-05-07T20:26:27.0984883Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:27.0985197Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.0985510Z #define _IO_STDIO_H 
2025-05-07T20:26:27.0985755Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:27.0986051Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:27.0986396Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:27.0986693Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.0987008Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:27.0987274Z #define __amd64 1
2025-05-07T20:26:27.0987498Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:27.0987773Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:27.0988057Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:27.0988344Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:27.0988654Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:27.0988931Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:27.0989226Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:27.0989499Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:27.0989753Z #define __bounded 
2025-05-07T20:26:27.0990310Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:27.0990606Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:27.0990888Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:27.0991160Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:27.0991432Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.0991754Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:27.0992173Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:27.0992571Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:27.0992847Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:27.0993197Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:27.0993533Z #define STA_PLL 0x0001
2025-05-07T20:26:27.0993780Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:27.0994051Z #define __GNUG__ 11
2025-05-07T20:26:27.0994280Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:27.0994552Z #define _T_WCHAR 
2025-05-07T20:26:27.0994794Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:27.0995084Z #define __specialization_static 
2025-05-07T20:26:27.0995396Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:27.0995712Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:27.0995973Z #define cudaArraySparse 0x40
2025-05-07T20:26:27.0996233Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:27.0996488Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:27.0996776Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:27.0997072Z #define _WCHAR_T 
2025-05-07T20:26:27.0997297Z #define __cudaCDP2Free 
2025-05-07T20:26:27.0997932Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:27.0998868Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:27.0999291Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:27.1000029Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:27.1000435Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:27.1000700Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:27.1001036Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:27.1001388Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:27.1001627Z #define __NO_CTYPE 1
2025-05-07T20:26:27.1001860Z #define __stub_bdflush 
2025-05-07T20:26:27.1002237Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:27.1002653Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:27.1002958Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:27.1003231Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:27.1003504Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:27.1003816Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:27.1004117Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:27.1004467Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:27.1004809Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:27.1005098Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:27.1005385Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:27.1005722Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:27.1006065Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:27.1006349Z #define _IO_STDIO 040000
2025-05-07T20:26:27.1006674Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:27.1007060Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:27.1007378Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:27.1007664Z #define _PTRDIFF_T 
2025-05-07T20:26:27.1007885Z #define _MOVE_H 1
2025-05-07T20:26:27.1008111Z #define __cpp_hex_float 201603L
2025-05-07T20:26:27.1008373Z #define ADJ_TAI 0x0080
2025-05-07T20:26:27.1008595Z #define __ptrvalue 
2025-05-07T20:26:27.1008826Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:27.1009084Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:27.1009364Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:27.1009673Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:27.1009931Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:27.1010212Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:27.1010606Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:27.1010985Z #define __USE_GNU 1
2025-05-07T20:26:27.1011214Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:27.1011493Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:27.1011770Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:27.1012150Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:27.1012535Z #define WEXITED 4
2025-05-07T20:26:27.1012757Z #define _IO_NO_READS 4
2025-05-07T20:26:27.1013060Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:27.1013402Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:27.1013688Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:27.1013990Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:27.1014306Z #define __uid_t_defined 
2025-05-07T20:26:27.1014558Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:27.1014849Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:27.1015121Z #define WNOHANG 1
2025-05-07T20:26:27.1015371Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:27.1015677Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:27.1015951Z #define cudaEventDefault 0x00
2025-05-07T20:26:27.1016255Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:27.1016584Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:27.1016819Z #define __x86_64 1
2025-05-07T20:26:27.1017056Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:27.1017456Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:27.1017940Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:27.1018431Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:27.1018987Z #define __PTRDIFF_T 
2025-05-07T20:26:27.1019401Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:27.1019781Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:27.1020194Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1020492Z #define _Mlong_double_ long double
2025-05-07T20:26:27.1020770Z #define __cpp_lambdas 200907L
2025-05-07T20:26:27.1021027Z #define _IO_DEC 020
2025-05-07T20:26:27.1021263Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:27.1021537Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:27.1021820Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:27.1022106Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:27.1022374Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:27.1022673Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:27.1022997Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:27.1023273Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:27.1023551Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:27.1023868Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:27.1024240Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:27.1024619Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:27.1024904Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:27.1025200Z #define __cpp_template_auto 201606L
2025-05-07T20:26:27.1025554Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:27.1025923Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:27.1026198Z #define __key_t_defined 
2025-05-07T20:26:27.1026452Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:27.1026815Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:27.1027279Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:27.1027644Z #define __GNUC_VA_LIST 
2025-05-07T20:26:27.1027977Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:27.1028369Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:27.1028640Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:27.1028916Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:27.1029211Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:27.1029466Z #define __WCOREFLAG 0x80
2025-05-07T20:26:27.1029724Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:27.1030026Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:27.1030306Z #define __LP64__ 1
2025-05-07T20:26:27.1030560Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:27.1030874Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:27.1031161Z #define _IO_off64_t __off64_t
2025-05-07T20:26:27.1031431Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1031692Z #define __time_t_defined 1
2025-05-07T20:26:27.1031949Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:27.1032298Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:27.1032664Z #define __USE_UNIX98 1
2025-05-07T20:26:27.1032912Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1033199Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:27.1033467Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:27.1033770Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:27.1034085Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:27.1034349Z #define SEEK_CUR 1
2025-05-07T20:26:27.1034577Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1034852Z #define _ASSERT_H 1
2025-05-07T20:26:27.1035420Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:27.1036044Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:27.1036324Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:27.1036580Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:27.1036843Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:27.1037125Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:27.1037615Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:27.1038024Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:27.1038796Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:27.1039454Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:27.1039748Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:27.1040093Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:27.1040474Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:27.1040747Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:27.1041026Z #define cudaArrayDefault 0x00
2025-05-07T20:26:27.1041310Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:27.1041606Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:27.1041884Z #define TLOSS 5
2025-05-07T20:26:27.1042107Z #define __ssize_t_defined 
2025-05-07T20:26:27.1042371Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:27.1042648Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:27.1042944Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:27.1043245Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:27.1043607Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:27.1043986Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:27.1044272Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:27.1044570Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:27.1044882Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:27.1045180Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:27.1045470Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:27.1045731Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:27.1046068Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:27.1046428Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:27.1046674Z #define __cdecl 
2025-05-07T20:26:27.1046934Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:27.1047280Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:27.1047614Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:27.1047867Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:27.1048144Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:27.1048439Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:27.1048705Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:27.1049018Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:27.1049356Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:27.1049758Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:27.1050192Z #define ADJ_NANO 0x2000
2025-05-07T20:26:27.1050503Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:27.1050861Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:27.1051152Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:27.1051418Z #define __FLT_DIG__ 6
2025-05-07T20:26:27.1060953Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:27.1061420Z #define __NO_INLINE__ 1
2025-05-07T20:26:27.1061756Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:27.1062126Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:27.1062432Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:27.1062807Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:27.1063139Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:27.1063415Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:27.1063727Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:27.1064026Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:27.1064416Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:27.1064832Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:27.1065266Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:27.1065624Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:27.1065870Z #define MAX_CANON 255
2025-05-07T20:26:27.1066403Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:27.1066692Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:27.1067085Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:27.1067441Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:27.1067754Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:27.1068054Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:27.1068336Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:27.1068681Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:27.1069075Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:27.1069341Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:27.1069641Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:27.1069940Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:27.1070222Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:27.1070540Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:27.1070843Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:27.1071104Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:27.1071378Z #define _SYS_TYPES_H 1
2025-05-07T20:26:27.1071627Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:27.1071896Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:27.1072148Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:27.1072387Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:27.1072661Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:27.1072963Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:27.1073220Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:27.1073510Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:27.1073788Z #define FP_SUBNORMAL 3
2025-05-07T20:26:27.1074048Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:27.1074326Z #define _INITIALIZER_LIST 
2025-05-07T20:26:27.1074582Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:27.1074835Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:27.1075109Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:27.1075400Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:27.1075662Z #define _IO_file_flags _flags
2025-05-07T20:26:27.1075930Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:27.1076176Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:27.1076461Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:27.1076740Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:27.1077006Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:27.1077391Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:27.1077785Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:27.1078089Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:27.1078363Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:27.1078621Z #define _BSD_SOURCE 1
2025-05-07T20:26:27.1078857Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:27.1079725Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:27.1080570Z #define __catch(X) catch(X)
2025-05-07T20:26:27.1080844Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:27.1081134Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:27.1081418Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:27.1081675Z #define __STRING(x) #x
2025-05-07T20:26:27.1081917Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:27.1082197Z #define _T_PTRDIFF_ 
2025-05-07T20:26:27.1082449Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:27.1082751Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:27.1083032Z #define __unbounded 
2025-05-07T20:26:27.1083281Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1083570Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:27.1083866Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1084170Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:27.1084450Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:27.1084745Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:27.1085074Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:27.1085486Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:27.1085764Z #define __managed__ __location__(managed)
2025-05-07T20:26:27.1086140Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:27.1086550Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:27.1086967Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:27.1087230Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:27.1087606Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:27.1088009Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:27.1088258Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:27.1088554Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:27.1088896Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:27.1089174Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:27.1089470Z #define _CRTIMP 
2025-05-07T20:26:27.1089698Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:27.1090360Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:27.1090779Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:27.1091147Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:27.1091554Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1091874Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:27.1092159Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:27.1092444Z #define __SIZE_T__ 
2025-05-07T20:26:27.1092665Z #define __stub_gtty 
2025-05-07T20:26:27.1092897Z #define __pid_t_defined 
2025-05-07T20:26:27.1093164Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:27.1093470Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1095240Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:27.1095541Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:27.1095782Z #define __need_clockid_t 
2025-05-07T20:26:27.1096032Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:27.1096295Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:27.1096617Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:27.1096940Z #define _IO_HEX 0100
2025-05-07T20:26:27.1097208Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:27.1097549Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:27.1097862Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:27.1098145Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:27.1098552Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:27.1098996Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:27.1099316Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:27.1099613Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:27.1099723Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:27.1099958Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:27.1100062Z #define __stub_sstk 
2025-05-07T20:26:27.1100158Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:27.1100316Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:27.1100408Z #define __wur 
2025-05-07T20:26:27.1100533Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:27.1100624Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:27.1100719Z #define _IO_OCT 040
2025-05-07T20:26:27.1100817Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:27.1100915Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:27.1101009Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:27.1101142Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:27.1101241Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:27.1101347Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:27.1101536Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:27.1101639Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:27.1101729Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:27.1101838Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:27.1101937Z #define __off64_t_defined 
2025-05-07T20:26:27.1102039Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:27.1102130Z #define __FLT128_DIG__ 33
2025-05-07T20:26:27.1102478Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:27.1102581Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:27.1102676Z #define __INT32_C(c) c
2025-05-07T20:26:27.1102892Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:27.1102995Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:27.1103100Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:27.1103197Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:27.1103287Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:27.1103394Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:27.1103526Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:27.1103622Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:27.1103718Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:27.1103818Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:27.1103919Z #define __have_pthread_attr_t 1
2025-05-07T20:26:27.1104029Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:27.1104249Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:27.1104373Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:27.1104478Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:27.1104579Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:27.1104673Z #define htole32(x) (x)
2025-05-07T20:26:27.1104928Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:27.1105051Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:27.1105158Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:27.1105315Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:27.1105458Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:27.1105592Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:27.1105732Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:27.1105830Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:27.1105934Z #define cudaArrayLayered 0x01
2025-05-07T20:26:27.1106108Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:27.1106231Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:27.1106328Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:27.1106437Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:27.1106526Z #define unix 1
2025-05-07T20:26:27.1106623Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:27.1106718Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:27.1106823Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:27.1106946Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:27.1107040Z #define __USE_POSIX 1
2025-05-07T20:26:27.1107135Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:27.1107270Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:27.1107369Z #define __THROWNL throw ()
2025-05-07T20:26:27.1107465Z #define __cpp_rtti 199711L
2025-05-07T20:26:27.1107572Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:27.1107670Z #define __PMT(args) args
2025-05-07T20:26:27.1107787Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1107939Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:27.1108065Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:27.1108158Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:27.1108263Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:27.1108365Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:27.1108758Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:27.1108867Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:27.1108962Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:27.1109060Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:27.1109209Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:27.1109292Z #define _WCHAR_T_H 
2025-05-07T20:26:27.1109385Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:27.1109482Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:27.1109578Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:27.1109680Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:27.1109783Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:27.1109971Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:27.1110089Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:27.1110280Z #define __ELF__ 1
2025-05-07T20:26:27.1110385Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:27.1110491Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:27.1110581Z #define STA_INS 0x0010
2025-05-07T20:26:27.1110683Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:27.1110863Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:27.1110961Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:27.1111057Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:27.1111179Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1111293Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1111395Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:27.1111509Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:27.1111609Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:27.1111775Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:27.1111941Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:27.1112048Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:27.1112378Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:27.1112511Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:27.1112605Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:27.1112702Z #define __FLT_RADIX__ 2
2025-05-07T20:26:27.1112808Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:27.1112975Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:27.1113079Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:27.1113175Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:27.1113284Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:27.1113381Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:27.1113480Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:27.1113605Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:27.1113699Z #define WORD_BIT 32
2025-05-07T20:26:27.1113790Z #define _IO_USER_BUF 1
2025-05-07T20:26:27.1113887Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:27.1114005Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1114120Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:27.1114223Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:27.1114329Z #define __long_double_t long double
2025-05-07T20:26:27.1114427Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:27.1114527Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:27.1114926Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:27.1115010Z #define __k8 1
2025-05-07T20:26:27.1115210Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:27.1115382Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:27.1115501Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:27.1115608Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:27.1115715Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:27.1115827Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:27.1115928Z #define __blksize_t_defined 
2025-05-07T20:26:27.1116024Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:27.1116131Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:27.1116247Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:27.1116343Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:27.1116456Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:27.1116554Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:27.1116651Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:27.1116913Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:27.1117253Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:27.1117365Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:27.1117465Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:27.1117636Z #define SEEK_SET 0
2025-05-07T20:26:27.1117744Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:27.1117917Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:27.1118114Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:27.1118230Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:27.1118336Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:27.1118431Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:27.1118530Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:27.1118847Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:27.1118947Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:27.1119052Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:27.1119142Z #define __stub_sigreturn 
2025-05-07T20:26:27.1119386Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:27.1119485Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:27.1119585Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:27.1119696Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:27.1119787Z #define CLOCK_TAI 11
2025-05-07T20:26:27.1119896Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:27.1119991Z #define __restrict_arr 
2025-05-07T20:26:27.1120105Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:27.1120248Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:27.1120779Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:27.1120965Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:27.1121056Z #define __USE_MISC 1
2025-05-07T20:26:27.1121161Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:27.1121261Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:27.1121356Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:27.1121447Z #define __LDBL_DIG__ 18
2025-05-07T20:26:27.1121545Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:27.1121657Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:27.1121754Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:27.1121861Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:27.1121953Z #define __x86_64__ 1
2025-05-07T20:26:27.1122037Z #define _SIZE_T_ 
2025-05-07T20:26:27.1122916Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:27.1123022Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:27.1123122Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:27.1123247Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:27.1123372Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:27.1123475Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:27.1123592Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:27.1123716Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:27.1123863Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:27.1123963Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:27.1124424Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:27.1124560Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:27.1124709Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:27.1124813Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:27.1124918Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:27.1125007Z #define STA_FLL 0x0008
2025-05-07T20:26:27.1125158Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:27.1125340Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:27.1125468Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1125661Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:27.1125751Z #define __stub_revoke 
2025-05-07T20:26:27.1125845Z #define __timer_t_defined 1
2025-05-07T20:26:27.1125986Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:27.1126078Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:27.1126188Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:27.1126302Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:27.1126402Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:27.1126512Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:27.1126626Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:27.1126730Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:27.1126885Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:27.1126985Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:27.1127083Z #define _IO_off_t __off_t
2025-05-07T20:26:27.1127178Z #define __FLT64_DIG__ 15
2025-05-07T20:26:27.1127407Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:27.1127510Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:27.1127647Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1127773Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:27.1127878Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:27.1127985Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:27.1128073Z #define NULL __null
2025-05-07T20:26:27.1128216Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:27.1128323Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:27.1128425Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:27.1128531Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1128627Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:27.1128711Z #define FP_ZERO 2
2025-05-07T20:26:27.1128820Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:27.1128978Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:27.1129088Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1129187Z #define __WCHAR_T__ 
2025-05-07T20:26:27.1129285Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:27.1129492Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:27.1129647Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:27.1129747Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:27.1129880Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:27.1129997Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:27.1130128Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:27.1130265Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:27.1130359Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:27.1130453Z #define _SIGSET_H_types 1
2025-05-07T20:26:27.1130578Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:27.1130685Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:27.1130847Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:27.1130961Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:27.1131086Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:27.1131225Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:27.1131335Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:27.1131469Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:27.1131650Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:27.1131746Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:27.1131851Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:27.1131957Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:27.1132047Z #define STA_MODE 0x4000
2025-05-07T20:26:27.1132165Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:27.1132269Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:27.1132387Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:27.1132587Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:27.1132684Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:27.1132867Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:27.1132972Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:27.1133091Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:27.1133183Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:27.1133310Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1133393Z #define __SEG_FS 1
2025-05-07T20:26:27.1133486Z #define _IO_size_t size_t
2025-05-07T20:26:27.1133591Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:27.1133692Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:27.1133784Z #define __stub_lchmod 
2025-05-07T20:26:27.1133877Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:27.1133987Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1134092Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:27.1134178Z #define __SEG_GS 1
2025-05-07T20:26:27.1134367Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:27.1134465Z #define _IOS_APPEND 8
2025-05-07T20:26:27.1134568Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:27.1134662Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:27.1134773Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:27.1134871Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:27.1134973Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:27.1135067Z #define htole16(x) (x)
2025-05-07T20:26:27.1135177Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:27.1135280Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:27.1135377Z #define __INT16_TYPE__ short int
2025-05-07T20:26:27.1135481Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:27.1135592Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:27.1135705Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:27.1135829Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:27.1135928Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:27.1136025Z #define __WCLONE 0x80000000
2025-05-07T20:26:27.1136118Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:27.1136217Z #define SEEK_HOLE 4
2025-05-07T20:26:27.1136306Z #define TIMER_ABSTIME 1
2025-05-07T20:26:27.1136404Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:27.1136505Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:27.1136682Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:27.1136802Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1136899Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:27.1137013Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:27.1137118Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1137244Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:27.1137334Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:27.1137420Z #define linux 1
2025-05-07T20:26:27.1137518Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:27.1137632Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:27.1137747Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:27.1137844Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:27.1137958Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:27.1138113Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:27.1138214Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:27.1138320Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1138425Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:27.1138519Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:27.1138614Z #define htole64(x) (x)
2025-05-07T20:26:27.1138719Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:27.1138847Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:27.1138952Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:27.1139444Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:27.1139537Z #define __USE_POSIX2 1
2025-05-07T20:26:27.1139649Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:27.1140038Z #define __WALL 0x40000000
2025-05-07T20:26:27.1140168Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:27.1140253Z #define _XLOCALE_H 1
2025-05-07T20:26:27.1140430Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:27.1140539Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:27.1140637Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1140742Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:27.1140841Z #define __EXCEPTIONS 1
2025-05-07T20:26:27.1140946Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:27.1141140Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:27.1141236Z #define __WORDSIZE 64
2025-05-07T20:26:27.1141332Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:27.1141422Z #define _STL_RELOPS_H 1
2025-05-07T20:26:27.1141524Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:27.1141624Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:27.1141733Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:27.1141827Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:27.1141934Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:27.1142250Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:27.1142482Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:27.1142613Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:27.1142720Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:27.1142823Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:27.1142938Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:27.1143050Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:27.1143159Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:27.1143349Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:27.1143448Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:27.1143542Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:27.1143653Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:27.1143827Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:27.1143948Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:27.1144042Z #define _STRING_H 1
2025-05-07T20:26:27.1144145Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:27.1144237Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:27.1144348Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:27.1144488Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:27.1144589Z #define __code_model_small__ 1
2025-05-07T20:26:27.1144679Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:27.1144784Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:27.1144906Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:27.1145004Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:27.1145107Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:27.1145445Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:27.1145543Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:27.1145634Z #define le64toh(x) (x)
2025-05-07T20:26:27.1145734Z #define FILENAME_MAX 4096
2025-05-07T20:26:27.1145891Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:27.1146018Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:27.1146101Z #define L_cuserid 9
2025-05-07T20:26:27.1146192Z #define __ino_t_defined 
2025-05-07T20:26:27.1146279Z #define __k8__ 1
2025-05-07T20:26:27.1146380Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:27.1146493Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:27.1146590Z #define __int8_t_defined 
2025-05-07T20:26:27.1146682Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:27.1146784Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1146906Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:27.1147006Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:27.1147095Z #define _IOS_TRUNC 16
2025-05-07T20:26:27.1147223Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:27.1147372Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:27.1149001Z #define __HAVE_COLUMN 
2025-05-07T20:26:27.1149094Z #define __stub_fdetach 
2025-05-07T20:26:27.1149593Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:27.1149694Z #define __pic__ 2
2025-05-07T20:26:27.1149818Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1149918Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:27.1150021Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:27.1150124Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:27.1150213Z #define __stub_chflags 
2025-05-07T20:26:27.1150313Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:27.1150400Z #define __need_IOV_MAX 
2025-05-07T20:26:27.1150517Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:27.1150624Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:27.1150726Z #define __cpp_decltype 200707L
2025-05-07T20:26:27.1150841Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:27.1150935Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:27.1151042Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:27.1151145Z #define TTY_NAME_MAX 32
2025-05-07T20:26:27.1151316Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:27.1151443Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1151624Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:27.1151738Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:27.1151836Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:27.1151941Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:27.1152026Z #define __import__ 
2025-05-07T20:26:27.1152130Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:27.1152268Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:27.1152354Z #define __export__ 
2025-05-07T20:26:27.1152483Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:27.1152588Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:27.1152756Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:27.1152870Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:27.1152964Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:27.1153063Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:27.1153165Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:27.1153288Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:27.1153416Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:27.1153525Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:27.1153620Z #define WNOWAIT 0x01000000
2025-05-07T20:26:27.1153711Z #define PLOSS 6
2025-05-07T20:26:27.1153811Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:27.1154075Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:27.1154173Z #define EXIT_SUCCESS 0
2025-05-07T20:26:27.1154271Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:27.1154372Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:27.1154486Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:27.1154577Z #define __thread__ __thread
2025-05-07T20:26:27.1154683Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:27.1154786Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:27.1154893Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:27.1155131Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:27.1155248Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:27.1155343Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:27.1155432Z #define __linux__ 1
2025-05-07T20:26:27.1155530Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:27.1155659Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:27.1155761Z #define __S16_TYPE short int
2025-05-07T20:26:27.1156111Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:27.1156220Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:27.1156511Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:27.1156613Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:27.1156796Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:27.1156882Z #define _T_SIZE_ 
2025-05-07T20:26:27.1156981Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:27.1157110Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:27.1157207Z #define _PSTL_VERSION 12000
2025-05-07T20:26:27.1157330Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:27.1157435Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:27.1157533Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:27.1157663Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:27.1157772Z #define _IOS_INPUT 1
2025-05-07T20:26:27.1157898Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:27.1158045Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:27.1158172Z #define __INT64_TYPE__ long int
2025-05-07T20:26:27.1158299Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:27.1158449Z #define __shared__ __location__(shared)
2025-05-07T20:26:27.1158547Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:27.1158710Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:27.1158808Z #define __gid_t_defined 
2025-05-07T20:26:27.1158926Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:27.1159026Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:27.1159235Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:27.1159337Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:27.1159430Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:27.1159526Z #define ___int_size_t_h 
2025-05-07T20:26:27.1159636Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1159768Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:27.1159926Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:27.1160033Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:27.1160134Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:27.1160238Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:27.1160335Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:27.1160475Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1160591Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:27.1160716Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:27.1160821Z #define __clock_t_defined 1
2025-05-07T20:26:27.1160924Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:27.1161056Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:27.1161156Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:27.1161274Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:27.1161382Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:27.1161494Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:27.1161587Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:27.1161766Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:27.1161850Z #define __SSE__ 1
2025-05-07T20:26:27.1161954Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:27.1162057Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:27.1162145Z #define _CTYPE_H 1
2025-05-07T20:26:27.1162238Z #define __sigset_t_defined 
2025-05-07T20:26:27.1162360Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:27.1162459Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:27.1162556Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:27.1162656Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:27.1162753Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:27.1162846Z #define __SM_70_RT_H__ 
2025-05-07T20:26:27.1162944Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:27.1163052Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:27.1163157Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:27.1163320Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:27.1163418Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:27.1163535Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:27.1163633Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:27.1163833Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:27.1163923Z #define __amd64__ 1
2025-05-07T20:26:27.1164093Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:27.1164211Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:27.1164479Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:27.1164582Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:27.1164672Z #define EOF (-1)
2025-05-07T20:26:27.1164771Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:27.1164867Z #define __USE_POSIX199309 1
2025-05-07T20:26:27.1164974Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:27.1165070Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:27.1165176Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:27.1165281Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:27.1165398Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:27.1165501Z #define ____mbstate_t_defined 1
2025-05-07T20:26:27.1165592Z #define STA_NANO 0x2000
2025-05-07T20:26:27.1165717Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:27.1165856Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:27.1165988Z #define _IO_LINKED 0x80
2025-05-07T20:26:27.1180093Z #define __cpp_lib_launder 201606
2025-05-07T20:26:27.1180239Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:27.1180349Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:27.1180450Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:27.1180550Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:27.1180701Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:27.1180819Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1180924Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:27.1181029Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:27.1181127Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:27.1181221Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:27.1181354Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:27.1181476Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:27.1181709Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:27.1181903Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:27.1181987Z #define __stub_stty 
2025-05-07T20:26:27.1182155Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:27.1182253Z #define le16toh(x) (x)
2025-05-07T20:26:27.1182361Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:27.1182540Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:27.1182622Z #define _SIZET_ 
2025-05-07T20:26:27.1182713Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:27.1182803Z #define _SVID_SOURCE 1
2025-05-07T20:26:27.1182883Z #define _LP64 1
2025-05-07T20:26:27.1182973Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:27.1183210Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:27.1183321Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:27.1183411Z #define __UINT8_C(c) c
2025-05-07T20:26:27.1183509Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:27.1183603Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:27.1183720Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:27.1183827Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:27.1183926Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:27.1184038Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:27.1184127Z #define CUDARTAPI 
2025-05-07T20:26:27.1184215Z #define IOV_MAX 1024
2025-05-07T20:26:27.1184371Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:27.1184475Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:27.1184582Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:27.1184676Z #define __wchar_t__ 
2025-05-07T20:26:27.1184782Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:27.1184871Z #define SEEK_END 2
2025-05-07T20:26:27.1184977Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:27.1185153Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:27.1185478Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:27.1185637Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:27.1185817Z #define ____FILE_defined 1
2025-05-07T20:26:27.1185948Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:27.1186060Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:27.1186154Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:27.1186260Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:27.1186512Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:27.1186647Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:27.1186744Z #define _IO_RIGHT 04
2025-05-07T20:26:27.1186843Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:27.1187034Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:27.1187135Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:27.1187259Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:27.1187366Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:27.1187482Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:27.1187570Z #define _STDDEF_H_ 
2025-05-07T20:26:27.1187759Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:27.1187862Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1187984Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:27.1188196Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:27.1188314Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:27.1188460Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:27.1188595Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:27.1188702Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:27.1188822Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:27.1188922Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:27.1189039Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:27.1189145Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:27.1189250Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:27.1189350Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:27.1189538Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:27.1189636Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:27.1190084Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:27.1190258Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:27.1190371Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:27.1190528Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:27.1190629Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:27.1190727Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:27.1190838Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:27.1190942Z #define P_tmpdir "/tmp"
2025-05-07T20:26:27.1191069Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:27.1191177Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:27.1191283Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:27.1191459Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:27.1191647Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:27.1191753Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:27.1191882Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:27.1192004Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:27.1192116Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:27.1192350Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:27.1192451Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:27.1192572Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:27.1192669Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:27.1192760Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:27.1192863Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:27.1192963Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:27.1193063Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:27.1193392Z #define __FXSR__ 1
2025-05-07T20:26:27.1193476Z #define _SIZE_T 
2025-05-07T20:26:27.1193588Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:27.1193847Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:27.1194024Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:27.1194184Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:27.1194283Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:27.1194387Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:27.1194579Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:27.1194782Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:27.1194877Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:27.1195009Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:27.1195101Z #define FOPEN_MAX 16
2025-05-07T20:26:27.1195199Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:27.1195321Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:27.1195428Z #define __suseconds_t_defined 
2025-05-07T20:26:27.1195527Z #define __off_t_defined 
2025-05-07T20:26:27.1195619Z #define stderr stderr
2025-05-07T20:26:27.1195715Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:27.1195837Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:27.1195944Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:27.1196039Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:27.1196450Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:27.1196545Z #define __mode_t_defined 
2025-05-07T20:26:27.1196636Z #define _GCC_SIZE_T 
2025-05-07T20:26:27.1196738Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1196844Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:27.1196957Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:27.1197054Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:27.1197150Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:27.1197272Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:27.1197382Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:27.1197495Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:27.1197598Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:27.1197680Z #define __size_t__ 
2025-05-07T20:26:27.1197814Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:27.1197919Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:27.1198032Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:27.1198192Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:27.1198289Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:27.1198460Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:27.1198552Z #define _ENDIAN_H 1
2025-05-07T20:26:27.1198660Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:27.1198760Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:27.1198879Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:27.1198966Z #define __try try
2025-05-07T20:26:27.1199071Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:27.1199175Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:27.1199276Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:27.1199615Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:27.1199748Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:27.1199866Z #define __PIC__ 2
2025-05-07T20:26:27.1199994Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:27.1200120Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:27.1200254Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:27.1200362Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:27.1200461Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:27.1200647Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:27.1200760Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:27.1200862Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:27.1200953Z #define _IO_uid_t __uid_t
2025-05-07T20:26:27.1201158Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:27.1201289Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:27.1201506Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:27.1201659Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:27.1201764Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:27.1201898Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:27.1201983Z #define LONG_BIT 64
2025-05-07T20:26:27.1202101Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:27.1202215Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:27.1202344Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:27.1202441Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:27.1202543Z #define __blkcnt_t_defined 
2025-05-07T20:26:27.1202812Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:27.1202913Z #define __USE_LARGEFILE 1
2025-05-07T20:26:27.1203017Z #define __cpp_constexpr 201603L
2025-05-07T20:26:27.1203120Z #define CUDART_VERSION 12060
2025-05-07T20:26:27.1203219Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:27.1203326Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:27.1203419Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:27.1203623Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:27.1203717Z #define __lldiv_t_defined 1
2025-05-07T20:26:27.1203800Z #define __SSE2__ 1
2025-05-07T20:26:27.1203888Z #define _IOLBF 1
2025-05-07T20:26:27.1203991Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:27.1204088Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:27.1204201Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:27.1204297Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:27.1204414Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:27.1204506Z #define __INT32_TYPE__ int
2025-05-07T20:26:27.1204600Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:27.1204715Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:27.1204823Z #define __cpp_exceptions 199711L
2025-05-07T20:26:27.1204920Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:27.1205042Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:27.1205135Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:27.1205252Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:27.1205418Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:27.1205517Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:27.1205617Z #define __SWORD_TYPE long int
2025-05-07T20:26:27.1205725Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:27.1205822Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:27.1205925Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:27.1206019Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:27.1206299Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:27.1206403Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:27.1206553Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:27.1206641Z #define _T_SIZE 
2025-05-07T20:26:27.1206756Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:27.1206887Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:27.1207016Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:27.1207120Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:27.1207211Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:27.1207342Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:27.1207439Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:27.1207539Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1207637Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:27.1207813Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:27.1207905Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:27.1208013Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:27.1208108Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:27.1208225Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1208311Z #define __PIE__ 2
2025-05-07T20:26:27.1208507Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:27.1208608Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:27.1208881Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:27.1209105Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:27.1209205Z #define __nlink_t_defined 
2025-05-07T20:26:27.1209336Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:27.1209458Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:27.1209556Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:27.1209815Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:27.1209937Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:27.1210048Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:27.1210150Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:27.1210254Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:27.1210393Z #define __FILE_defined 1
2025-05-07T20:26:27.1210640Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:27.1210794Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:27.1210898Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:27.1211033Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:27.1211205Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:27.1211320Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:27.1211423Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:27.1211519Z #define __INT16_C(c) c
2025-05-07T20:26:27.1211614Z #define __U32_TYPE unsigned int
2025-05-07T20:26:27.1211713Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:27.1211843Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:27.1211925Z #define __STDC__ 1
2025-05-07T20:26:27.1212030Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:27.1212135Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:27.1212236Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:27.1212405Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:27.1212496Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:27.1212603Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:27.1212711Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:27.1212829Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:27.1212945Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:27.1213050Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:27.1213155Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:27.1213239Z #define stdin stdin
2025-05-07T20:26:27.1213337Z #define __ino64_t_defined 
2025-05-07T20:26:27.1213427Z #define STA_CLK 0x8000
2025-05-07T20:26:27.1213531Z #define __clockid_t_defined 1
2025-05-07T20:26:27.1213680Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:27.1213847Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:27.1213959Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:27.1214071Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:27.1214180Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:27.1214303Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:27.1214500Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:27.1214595Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:27.1215128Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:27.1215214Z #define DOMAIN 1
2025-05-07T20:26:27.1215313Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:27.1215397Z #define __NVCC__ 1
2025-05-07T20:26:27.1215501Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:27.1215624Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:27.1215727Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:27.1215831Z #define __throw_exception_again throw
2025-05-07T20:26:27.1216040Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:27.1216131Z #define __EXCEPTION_H 1
2025-05-07T20:26:27.1216307Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:27.1216419Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:27.1216725Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:27.1216843Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:27.1216945Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:27.1217042Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:27.1217157Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:27.1217257Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:27.1217401Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:27.1217519Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:27.1217631Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:27.1217728Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:27.1217849Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:27.1217947Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:27.1218063Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:27.1218202Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:27.1218300Z #define __useconds_t_defined 
2025-05-07T20:26:27.1218407Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:27.1218594Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:27.1218745Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:27.1218842Z #define __SSE_MATH__ 1
2025-05-07T20:26:27.1218935Z #define _IO_wint_t wint_t
2025-05-07T20:26:27.1219030Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:27.1219130Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:27.1219229Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:27.1219354Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:27.1219452Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:27.1219548Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:27.1219648Z #define __USE_ATFILE 1
2025-05-07T20:26:27.1219741Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:27.1219979Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:27.1220077Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:27.1220307Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:27.1220409Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:27.1220516Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:27.1220620Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:27.1220732Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:27.1220822Z #define _STDLIB_H 1
2025-05-07T20:26:27.1220965Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:27.1221071Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:27.1221170Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:27.1221300Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:27.1221417Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:27.1221519Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:27.1221705Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:27.1221873Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:27.1221979Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:27.1222097Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:27.1222202Z #define __ldiv_t_defined 1
2025-05-07T20:26:27.1222381Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:27.1222483Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:27.1222656Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:27.1222760Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:27.1222862Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:27.1222966Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:27.1223067Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:27.1223174Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:27.1223255Z #define CUDART_CB 
2025-05-07T20:26:27.1223448Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:27.1223585Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:27.1223749Z #define MB_LEN_MAX 16
2025-05-07T20:26:27.1223985Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:27.1224087Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:27.1224216Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:27.1224337Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:27.1224437Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:27.1224585Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:27.1224701Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:27.1224789Z #define _GNU_SOURCE 1
2025-05-07T20:26:27.1224877Z #define __stub_putmsg 
2025-05-07T20:26:27.1224967Z #define __CUDACC__ 1
2025-05-07T20:26:27.1225059Z #define __N(msgid) (msgid)
2025-05-07T20:26:27.1225147Z #define __P(args) args
2025-05-07T20:26:27.1225412Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:27.1225514Z #define __cpp_init_captures 201304L
2025-05-07T20:26:27.1225634Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:27.1225727Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:27.1225829Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:27.1225918Z #define __WCHAR_T 
2025-05-07T20:26:27.1226010Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:27.1226106Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:27.1226230Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:27.1226330Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:27.1226337Z 
2025-05-07T20:26:27.1442568Z 
2025-05-07T20:26:27.1443031Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:27.1443051Z 
2025-05-07T20:26:29.0315368Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:29.0315757Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:29.0316071Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:29.0316419Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:29.0316752Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:29.0316952Z 
2025-05-07T20:26:29.0947434Z 
2025-05-07T20:26:29.0960704Z /usr/bin/nvidia-smi
2025-05-07T20:26:29.0966449Z + nvidia-smi
2025-05-07T20:26:29.0966587Z 
2025-05-07T20:26:29.1145001Z Wed May  7 20:26:29 2025       
2025-05-07T20:26:29.1145407Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:29.1145908Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:29.1146389Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:29.1146877Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:29.1147393Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:29.1147818Z |                                         |                        |               MIG M. |
2025-05-07T20:26:29.1148197Z |=========================================+========================+======================|
2025-05-07T20:26:29.1316174Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:29.1316619Z |  0%   27C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:29.1316990Z |                                         |                        |                  N/A |
2025-05-07T20:26:29.1317379Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:29.1320950Z                                                                                          
2025-05-07T20:26:29.1321348Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:29.1321776Z | Processes:                                                                              |
2025-05-07T20:26:29.1322562Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:29.1323092Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:29.1323436Z |=========================================================================================|
2025-05-07T20:26:29.1327359Z |  No running processes found                                                             |
2025-05-07T20:26:29.1327825Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:29.4045299Z 
2025-05-07T20:26:29.4050231Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:26:29.4103772Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:26:29.4104328Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:26:29.4116138Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:29.4116498Z env:
2025-05-07T20:26:29.4116721Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:29.4117014Z   BUILD_ENV: build_binary
2025-05-07T20:26:29.4117258Z   BUILD_TARGET: genai
2025-05-07T20:26:29.4117490Z   BUILD_VARIANT: cuda
2025-05-07T20:26:29.4117722Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:26:29.4117965Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:29.4118262Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:29.4118587Z ##[endgroup]
2025-05-07T20:26:29.7505546Z ################################################################################
2025-05-07T20:26:29.7505938Z # Install PyTorch (PIP)
2025-05-07T20:26:29.7506181Z #
2025-05-07T20:26:29.7520971Z # [2025-05-07T20:26:29.751Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:26:29.7521412Z ################################################################################
2025-05-07T20:26:29.7521632Z 
2025-05-07T20:26:29.7550982Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:30.7518179Z Channels:
2025-05-07T20:26:30.7518436Z  - conda-forge
2025-05-07T20:26:30.7518673Z Platform: linux-64
2025-05-07T20:26:34.0088315Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:34.7266315Z Solving environment: \ | / done
2025-05-07T20:26:34.9419370Z 
2025-05-07T20:26:34.9420026Z ## Package Plan ##
2025-05-07T20:26:34.9420267Z 
2025-05-07T20:26:34.9420572Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:34.9420989Z 
2025-05-07T20:26:34.9421116Z   added / updated specs:
2025-05-07T20:26:34.9421447Z     - numpy
2025-05-07T20:26:34.9421610Z 
2025-05-07T20:26:34.9421647Z 
2025-05-07T20:26:34.9421814Z The following packages will be downloaded:
2025-05-07T20:26:34.9422109Z 
2025-05-07T20:26:34.9422272Z     package                    |            build
2025-05-07T20:26:34.9422629Z     ---------------------------|-----------------
2025-05-07T20:26:34.9423136Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:34.9423674Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:34.9424139Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:34.9424577Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:34.9425024Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:34.9425492Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:34.9425929Z     numpy-2.2.5                |  py310hefbff90_0         7.6 MB  conda-forge
2025-05-07T20:26:34.9426312Z     ------------------------------------------------------------
2025-05-07T20:26:34.9426660Z                                            Total:        14.8 MB
2025-05-07T20:26:34.9426867Z 
2025-05-07T20:26:34.9427001Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:34.9427524Z 
2025-05-07T20:26:34.9427740Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:34.9428229Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:34.9438241Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:34.9438983Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:34.9439597Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:34.9440151Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:34.9440944Z   numpy              conda-forge/linux-64::numpy-2.2.5-py310hefbff90_0 
2025-05-07T20:26:34.9441285Z 
2025-05-07T20:26:34.9441292Z 
2025-05-07T20:26:34.9441297Z 
2025-05-07T20:26:34.9441495Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:34.9441938Z numpy-2.2.5          | 7.6 MB    |            |   0% 
2025-05-07T20:26:34.9442198Z 
2025-05-07T20:26:34.9442599Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:34.9442840Z 
2025-05-07T20:26:34.9442845Z 
2025-05-07T20:26:34.9443073Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:34.9443321Z 
2025-05-07T20:26:34.9443325Z 
2025-05-07T20:26:34.9443329Z 
2025-05-07T20:26:34.9451965Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:34.9452280Z 
2025-05-07T20:26:34.9452284Z 
2025-05-07T20:26:34.9452288Z 
2025-05-07T20:26:34.9452292Z 
2025-05-07T20:26:34.9467601Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:34.9467948Z 
2025-05-07T20:26:34.9467962Z 
2025-05-07T20:26:34.9467966Z 
2025-05-07T20:26:34.9467970Z 
2025-05-07T20:26:34.9474183Z 
2025-05-07T20:26:34.9475825Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:34.9476188Z 
2025-05-07T20:26:34.9476192Z 
2025-05-07T20:26:34.9476195Z 
2025-05-07T20:26:34.9476207Z 
2025-05-07T20:26:34.9476211Z 
2025-05-07T20:26:34.9476215Z 
2025-05-07T20:26:35.0739737Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:35.0740132Z 
2025-05-07T20:26:35.0740136Z 
2025-05-07T20:26:35.0740140Z 
2025-05-07T20:26:35.0797357Z 
2025-05-07T20:26:35.0882331Z libblas-3.9.0        | 16 KB     | #########7 |  97% [A[A[A[A
2025-05-07T20:26:35.0882596Z 
2025-05-07T20:26:35.0882601Z 
2025-05-07T20:26:35.0955233Z 
2025-05-07T20:26:35.1366439Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:26:35.1366724Z 
2025-05-07T20:26:35.1366728Z 
2025-05-07T20:26:35.1366732Z 
2025-05-07T20:26:35.1369178Z 
2025-05-07T20:26:35.1445761Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:35.1446022Z 
2025-05-07T20:26:35.1446027Z 
2025-05-07T20:26:35.1446031Z 
2025-05-07T20:26:35.2089124Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:35.2089407Z 
2025-05-07T20:26:35.2089411Z 
2025-05-07T20:26:35.2089428Z 
2025-05-07T20:26:35.2089432Z 
2025-05-07T20:26:35.2096702Z 
2025-05-07T20:26:35.2100342Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:35.2103891Z numpy-2.2.5          | 7.6 MB    |            |   0% 
2025-05-07T20:26:35.2104136Z 
2025-05-07T20:26:35.2104141Z 
2025-05-07T20:26:35.2104145Z 
2025-05-07T20:26:35.2104148Z 
2025-05-07T20:26:35.2104152Z 
2025-05-07T20:26:35.2106244Z 
2025-05-07T20:26:35.2146792Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:35.2147162Z 
2025-05-07T20:26:35.2147166Z 
2025-05-07T20:26:35.2147170Z 
2025-05-07T20:26:35.2147174Z 
2025-05-07T20:26:35.2147178Z 
2025-05-07T20:26:35.2147192Z 
2025-05-07T20:26:35.2153558Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:35.2153924Z 
2025-05-07T20:26:35.2153928Z 
2025-05-07T20:26:35.2153932Z 
2025-05-07T20:26:35.2153936Z 
2025-05-07T20:26:35.2155276Z 
2025-05-07T20:26:35.2632340Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:35.2632855Z 
2025-05-07T20:26:35.2632859Z 
2025-05-07T20:26:35.2707177Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:35.2707442Z 
2025-05-07T20:26:35.2707447Z 
2025-05-07T20:26:35.2707450Z 
2025-05-07T20:26:35.2708566Z 
2025-05-07T20:26:35.2789321Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:35.2789657Z 
2025-05-07T20:26:35.2789667Z 
2025-05-07T20:26:35.2789673Z 
2025-05-07T20:26:35.2789678Z 
2025-05-07T20:26:35.2789683Z 
2025-05-07T20:26:35.2794328Z 
2025-05-07T20:26:35.2893220Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:35.2894098Z 
2025-05-07T20:26:35.2894110Z 
2025-05-07T20:26:35.2894121Z 
2025-05-07T20:26:35.2899751Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:35.2900334Z 
2025-05-07T20:26:35.2900341Z 
2025-05-07T20:26:35.2901108Z 
2025-05-07T20:26:35.3007464Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:35.3007756Z 
2025-05-07T20:26:35.3033930Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:35.3034308Z 
2025-05-07T20:26:35.3034317Z 
2025-05-07T20:26:35.3034324Z 
2025-05-07T20:26:35.3034333Z 
2025-05-07T20:26:35.3034341Z 
2025-05-07T20:26:35.3086510Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:35.3086890Z 
2025-05-07T20:26:35.3086894Z 
2025-05-07T20:26:35.3091447Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:35.3543272Z numpy-2.2.5          | 7.6 MB    | ######5    |  66% 
2025-05-07T20:26:35.3908823Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:35.3909174Z 
2025-05-07T20:26:35.3910347Z 
2025-05-07T20:26:35.3915485Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:35.3915817Z 
2025-05-07T20:26:35.3916087Z 
2025-05-07T20:26:35.4064409Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:35.4066155Z 
2025-05-07T20:26:35.4067072Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.4067849Z 
2025-05-07T20:26:35.5460724Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.5461275Z 
2025-05-07T20:26:35.8258311Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.8266497Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:35.8267055Z                                                      
2025-05-07T20:26:35.8267409Z 
2025-05-07T20:26:35.8267716Z                                                      [A
2025-05-07T20:26:35.8268033Z 
2025-05-07T20:26:35.8268040Z 
2025-05-07T20:26:35.8268291Z                                                      [A[A
2025-05-07T20:26:35.8268592Z 
2025-05-07T20:26:35.8268599Z 
2025-05-07T20:26:35.8268604Z 
2025-05-07T20:26:35.8268866Z                                                      [A[A[A
2025-05-07T20:26:35.8269169Z 
2025-05-07T20:26:35.8269175Z 
2025-05-07T20:26:35.8269189Z 
2025-05-07T20:26:35.8269195Z 
2025-05-07T20:26:35.8269460Z                                                      [A[A[A[A
2025-05-07T20:26:35.8269759Z 
2025-05-07T20:26:35.8269765Z 
2025-05-07T20:26:35.8269770Z 
2025-05-07T20:26:35.8269775Z 
2025-05-07T20:26:35.8269780Z 
2025-05-07T20:26:35.8270005Z                                                      [A[A[A[A[A
2025-05-07T20:26:35.8270223Z 
2025-05-07T20:26:35.8270228Z 
2025-05-07T20:26:35.8270233Z 
2025-05-07T20:26:35.8270238Z 
2025-05-07T20:26:35.8270243Z 
2025-05-07T20:26:35.8270248Z 
2025-05-07T20:26:35.8270522Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:35.9276257Z Preparing transaction: \ done
2025-05-07T20:26:36.1283616Z Verifying transaction: / - done
2025-05-07T20:26:36.2293180Z Executing transaction: | done
2025-05-07T20:26:36.4080268Z ################################################################################
2025-05-07T20:26:36.4080672Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:36.4081359Z #
2025-05-07T20:26:36.4096695Z # [2025-05-07T20:26:36.409Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:26:36.4097165Z ################################################################################
2025-05-07T20:26:36.4097388Z 
2025-05-07T20:26:36.4112435Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:36.5081644Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:36.5082011Z ################################################################################
2025-05-07T20:26:36.5082339Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:36.5082621Z #
2025-05-07T20:26:36.5102523Z # [2025-05-07T20:26:36.509Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:26:36.5102963Z ################################################################################
2025-05-07T20:26:36.5103184Z 
2025-05-07T20:26:36.5126231Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:36.5151357Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:26:36.5167904Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:36.5168464Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:26:36.5176674Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:36.5185656Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:26:36.5207425Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.1826773Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.1828657Z Collecting torch
2025-05-07T20:27:55.1829632Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:27:55.1830386Z Collecting filelock (from torch)
2025-05-07T20:27:55.1830916Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:27:55.1831860Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from torch) (4.13.2)
2025-05-07T20:27:55.1832609Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:27:55.1833111Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:27:55.1833951Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 45.5 MB/s eta 0:00:00
2025-05-07T20:27:55.1834331Z Collecting networkx (from torch)
2025-05-07T20:27:55.1834833Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:27:55.1835480Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1835831Z Collecting jinja2 (from torch)
2025-05-07T20:27:55.1836315Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:27:55.1836818Z Collecting fsspec (from torch)
2025-05-07T20:27:55.1837315Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:27:55.1837875Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.1838582Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:27:55.1839361Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 72.5 MB/s eta 0:00:00
2025-05-07T20:27:55.1839782Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.1840491Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:27:55.1841266Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.9 MB/s eta 0:00:00
2025-05-07T20:27:55.1842508Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:27:55.1843202Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:27:55.1843959Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 46.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1844350Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:27:55.1845020Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:27:55.1845768Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.5 MB/s eta 0:00:00
2025-05-07T20:27:55.1846353Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:27:55.1847116Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:27:55.1847954Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 67.4 MB/s eta 0:00:00
2025-05-07T20:27:55.1848330Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:27:55.1848990Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:27:55.1849746Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 165.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1850118Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:27:55.1850786Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:27:55.1851556Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 225.8 MB/s eta 0:00:00
2025-05-07T20:27:55.1852077Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:27:55.1852767Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:27:55.1853558Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 145.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1853946Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:27:55.1854639Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:27:55.1855398Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 141.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1855788Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:27:55.1856488Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:27:55.1857271Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 162.4 MB/s eta 0:00:00
2025-05-07T20:27:55.1857636Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:27:55.1858393Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:27:55.1859173Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.1859813Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:27:55.1860603Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:27:55.1861395Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:27:55.1862258Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 157.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1862649Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:27:55.1863444Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:27:55.1864260Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:27:55.1865219Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:27:55.1866489Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:27:55.1867363Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:27:55.1867971Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:27:55.1868703Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 49.9 MB/s eta 0:00:00
2025-05-07T20:27:55.1869073Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:27:55.1869775Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
2025-05-07T20:27:55.1870823Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl (825.5 MB)
2025-05-07T20:27:55.1871642Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 36.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1872396Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:27:55.1873236Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.7 MB/s eta 0:00:00
2025-05-07T20:27:55.1873988Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:27:55.1874844Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 106.3 MB/s eta 0:00:00
2025-05-07T20:27:55.1875625Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB)
2025-05-07T20:27:55.1876504Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 134.0 MB/s eta 0:00:00
2025-05-07T20:27:55.1878356Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:27:55.1880051Z 
2025-05-07T20:27:55.1882132Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:27:55.1884886Z 
2025-05-07T20:27:57.4076770Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:27:57.4079466Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:00.8470418Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:04.3067849Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:04.3068383Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:07.6778546Z True
2025-05-07T20:28:07.6778799Z True
2025-05-07T20:28:07.6779230Z 
2025-05-07T20:28:07.7407410Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:07.7444110Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:07.7444713Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:07.7456314Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:07.7456655Z env:
2025-05-07T20:28:07.7456879Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:07.7457166Z   BUILD_ENV: build_binary
2025-05-07T20:28:07.7457407Z   BUILD_TARGET: genai
2025-05-07T20:28:07.7457632Z   BUILD_VARIANT: cuda
2025-05-07T20:28:07.7457865Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:07.7458110Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:07.7458408Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:07.7458732Z ##[endgroup]
2025-05-07T20:28:08.0835763Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:08.0837640Z ################################################################################
2025-05-07T20:28:08.0838132Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:08.0838490Z #
2025-05-07T20:28:08.0853321Z # [2025-05-07T20:28:08.085Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:08.0853737Z ################################################################################
2025-05-07T20:28:08.0853946Z 
2025-05-07T20:28:08.0870462Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:08.1771116Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:08.1781285Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:08.1781891Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:08.1782291Z 
2025-05-07T20:28:08.2674112Z 
2025-05-07T20:28:08.2674661Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:08.2696081Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:14.2323609Z Collecting environment information...
2025-05-07T20:28:14.2323990Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:14.2324283Z Is debug build: False
2025-05-07T20:28:14.2324542Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:14.2324820Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:14.2324993Z 
2025-05-07T20:28:14.2325098Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:14.2325418Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:14.2325734Z Clang version: Could not collect
2025-05-07T20:28:14.2326002Z CMake version: Could not collect
2025-05-07T20:28:14.2326269Z Libc version: glibc-2.34
2025-05-07T20:28:14.2326432Z 
2025-05-07T20:28:14.2326735Z Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:14.2327339Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:14.2327735Z Is CUDA available: True
2025-05-07T20:28:14.2327985Z CUDA runtime version: 12.6.85
2025-05-07T20:28:14.2328270Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:14.2328731Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:14.2329455Z Nvidia driver version: 570.133.07
2025-05-07T20:28:14.2330035Z cuDNN version: Could not collect
2025-05-07T20:28:14.2330381Z HIP runtime version: N/A
2025-05-07T20:28:14.2330768Z MIOpen runtime version: N/A
2025-05-07T20:28:14.2331253Z Is XNNPACK available: True
2025-05-07T20:28:14.2331464Z 
2025-05-07T20:28:14.2331565Z CPU:
2025-05-07T20:28:14.2331898Z Architecture:                         x86_64
2025-05-07T20:28:14.2332408Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:14.2332907Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:14.2333402Z Byte Order:                           Little Endian
2025-05-07T20:28:14.2333798Z CPU(s):                               16
2025-05-07T20:28:14.2343517Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:14.2344129Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:14.2344489Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:14.2344811Z CPU family:                           23
2025-05-07T20:28:14.2345103Z Model:                                49
2025-05-07T20:28:14.2345392Z Thread(s) per core:                   2
2025-05-07T20:28:14.2345689Z Core(s) per socket:                   8
2025-05-07T20:28:14.2345965Z Socket(s):                            1
2025-05-07T20:28:14.2346251Z Stepping:                             0
2025-05-07T20:28:14.2346557Z BogoMIPS:                             5599.62
2025-05-07T20:28:14.2348587Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:14.2350607Z Hypervisor vendor:                    KVM
2025-05-07T20:28:14.2350926Z Virtualization type:                  full
2025-05-07T20:28:14.2351262Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.2351633Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.2351997Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:14.2352345Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:14.2352669Z NUMA node(s):                         1
2025-05-07T20:28:14.2352965Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:14.2353297Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:14.2353679Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:14.2354037Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:14.2354388Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:14.2354744Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:14.2355101Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:14.2355467Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:14.2355998Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:14.2356576Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:14.2357111Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:14.2357798Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:14.2358647Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:14.2359316Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:14.2359679Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:14.2360004Z 
2025-05-07T20:28:14.2360108Z Versions of relevant libraries:
2025-05-07T20:28:14.2360378Z [pip3] numpy==2.2.5
2025-05-07T20:28:14.2360626Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:14.2360933Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:14.2361238Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:14.2361556Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:14.2361869Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:14.2362148Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:14.2362438Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:14.2362733Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:14.2363030Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:14.2363441Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:14.2363738Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:14.2364015Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:14.2364312Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:14.2364606Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:14.2364906Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:14.2365266Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2365752Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2366263Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.2366772Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2367302Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.2367827Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.2368311Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2368776Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:14.2369266Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.2369758Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.2370225Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2370683Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.2371138Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2371590Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2372054Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.2372533Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:14.2372994Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.2373499Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.2373968Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2374423Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:14.2374878Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2375332Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:14.2375801Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.2376279Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.2376760Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2377230Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:14.2377711Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.2378289Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.2378741Z [conda] numpy                     2.2.5           py310hefbff90_0    conda-forge
2025-05-07T20:28:14.2379200Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:14.2379694Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:14.2380310Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.2380804Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.2381288Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:14.2381851Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:14.2382320Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:14.2382806Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:14.2383348Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:14.2383841Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:14.2384313Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:14.2384787Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:14.2385260Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.2385728Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:14.2386186Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:14.2386459Z 
2025-05-07T20:28:14.3073678Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:14.3074355Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:14.3086962Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:14.3087305Z env:
2025-05-07T20:28:14.3087529Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:14.3087829Z   BUILD_ENV: build_binary
2025-05-07T20:28:14.3088063Z   BUILD_TARGET: genai
2025-05-07T20:28:14.3088295Z   BUILD_VARIANT: cuda
2025-05-07T20:28:14.3088543Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:14.3088796Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:14.3089101Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:14.3089434Z ##[endgroup]
2025-05-07T20:28:14.6485653Z ################################################################################
2025-05-07T20:28:14.6486182Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:14.6486501Z #
2025-05-07T20:28:14.6502076Z # [2025-05-07T20:28:14.649Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:14.6502640Z ################################################################################
2025-05-07T20:28:14.6502952Z 
2025-05-07T20:28:14.6517302Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:14.7456483Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:14.7479488Z [BUILD] Running git submodules update ...
2025-05-07T20:28:14.7500028Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:14.7866440Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:14.7867091Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:14.7867658Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:14.7868053Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:14.7868475Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:14.7868938Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:14.7869342Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:14.7902364Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:14.8458306Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:14.8480578Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:17.2596065Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:17.2775086Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:17.3895540Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:17.3965826Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:17.6359381Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:17.6393607Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:17.7558592Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:17.7784369Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:18.1262447Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:18.1293733Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:18.1884194Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:18.1888138Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:18.2621380Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:18.2650324Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:18.3144654Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:18.3807988Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:18.3837477Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:18.5165771Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:18.5199085Z   Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:18.6320789Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:18.6374154Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:18.7008615Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:18.7689751Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:18.7719084Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:18.8769179Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:18.8798981Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:18.9907956Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:18.9937937Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:19.1075374Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.1106815Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:19.2116168Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.2165605Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:19.3305232Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.3337869Z   Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.4508021Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.4544239Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.5886356Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.5914845Z   Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
2025-05-07T20:28:19.6845691Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.6878713Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.7411464Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:19.7935780Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:19.7964666Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:19.8472049Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:19.9008994Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:19.9037918Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:19.9536242Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:20.0214596Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.0247694Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:20.0786255Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:20.1393356Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:20.1978222Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:20.7237050Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 53.0 MB/s eta 0:00:00
2025-05-07T20:28:20.7269768Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:20.7816264Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:20.8447832Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:20.8986631Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:20.9656858Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:21.0259712Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
2025-05-07T20:28:21.0855121Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 8.7 MB/s eta 0:00:00
2025-05-07T20:28:21.0931798Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:21.1402300Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:21.1883009Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:21.2405593Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:21.2940095Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:21.3425233Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
2025-05-07T20:28:21.4034092Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:21.4563298Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
2025-05-07T20:28:21.5039806Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:21.5530277Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:21.6026437Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:21.6549948Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:21.8815883Z Installing collected packages: sortedcontainers, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:24.2732469Z 
2025-05-07T20:28:24.2804058Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0
2025-05-07T20:28:24.4702333Z ################################################################################
2025-05-07T20:28:24.4702739Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:24.4703004Z #
2025-05-07T20:28:24.4719501Z # [2025-05-07T20:28:24.471Z] + install_triton_pip build_binary
2025-05-07T20:28:24.4719929Z ################################################################################
2025-05-07T20:28:24.4720164Z 
2025-05-07T20:28:24.4720385Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:24.4720936Z ################################################################################
2025-05-07T20:28:24.4721291Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:24.4721596Z #
2025-05-07T20:28:24.4736112Z # [2025-05-07T20:28:24.473Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:24.4736639Z ################################################################################
2025-05-07T20:28:24.4736847Z 
2025-05-07T20:28:24.4751974Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:24.5676916Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:24.5677594Z ################################################################################
2025-05-07T20:28:24.5677951Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:24.5679973Z #
2025-05-07T20:28:24.5696712Z # [2025-05-07T20:28:24.569Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:24.5697234Z ################################################################################
2025-05-07T20:28:24.5697453Z 
2025-05-07T20:28:24.5745161Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:24.5761766Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:24.5762275Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:24.5771354Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:24.5781451Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:24.5802993Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.3834495Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:32.3835817Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:32.3836538Z 
2025-05-07T20:28:32.3836756Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.3837175Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:32.3837981Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:32.3839281Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:32.3840690Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 53.4 MB/s eta 0:00:00
2025-05-07T20:28:32.3841079Z Installing collected packages: pytorch-triton
2025-05-07T20:28:32.3841416Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:32.3841803Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:32.3842222Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:32.3842638Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:32.3843069Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:32.3843331Z 
2025-05-07T20:28:34.5798963Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:34.5802237Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:36.7275523Z ################################################################################
2025-05-07T20:28:36.7276141Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:36.7276543Z ################################################################################
2025-05-07T20:28:36.7276776Z 
2025-05-07T20:28:38.7717890Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:40.8825612Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:40.8829416Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:40.8884449Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:40.8884934Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:40.8897103Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:40.8897444Z env:
2025-05-07T20:28:40.8897672Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:40.8898147Z   BUILD_ENV: build_binary
2025-05-07T20:28:40.8898396Z   BUILD_TARGET: genai
2025-05-07T20:28:40.8898624Z   BUILD_VARIANT: cuda
2025-05-07T20:28:40.8898857Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:40.8899107Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:40.8899407Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:40.8899742Z ##[endgroup]
2025-05-07T20:28:41.2259742Z ################################################################################
2025-05-07T20:28:41.2260235Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:41.2260492Z #
2025-05-07T20:28:41.2277282Z # [2025-05-07T20:28:41.227Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2278265Z ################################################################################
2025-05-07T20:28:41.2278585Z 
2025-05-07T20:28:41.2279103Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2280134Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2280472Z 
2025-05-07T20:28:41.2396661Z 4d1609ed0721ee216ce1a19f96ff799eee4aae34  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2399564Z 
2025-05-07T20:28:41.2400091Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2400596Z 
2025-05-07T20:28:41.2528644Z ad43f456d1673a9cf1f77f0929f0cfd284ec9b8069b0a67a8cf77246792fe8cf  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2531349Z 
2025-05-07T20:28:41.2531801Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2532155Z 
2025-05-07T20:28:41.2756489Z c264a66986d7747c3b5c78c4d7455217  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.2759178Z 
2025-05-07T20:28:41.2768382Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:41.2790115Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.0016445Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.0017807Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:28:44.0018628Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:44.0019061Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:44.0019337Z 
2025-05-07T20:28:50.9312878Z ################################################################################
2025-05-07T20:28:50.9313290Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:50.9313689Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:50.9314113Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:28:50.9314430Z [CHECK]
2025-05-07T20:28:50.9314749Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:50.9315241Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:50.9315667Z ################################################################################
2025-05-07T20:28:50.9315876Z 
2025-05-07T20:28:50.9316001Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:54.8472973Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:28:58.8027644Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:02.7558604Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:02.7562220Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:14.5305629Z ################################################################################
2025-05-07T20:29:14.5307756Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:14.5308198Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:14.5308729Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:14.5309082Z ################################################################################
2025-05-07T20:29:14.5309326Z 
2025-05-07T20:29:22.3655406Z ################################################################################
2025-05-07T20:29:22.3655855Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:22.3657255Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:22.3659013Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:22.3659537Z ################################################################################
2025-05-07T20:29:22.3659754Z 
2025-05-07T20:29:22.3660036Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:26.2904147Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:30.2002063Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:34.2596161Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:38.1899070Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:38.1902463Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:42.0506369Z fbgemm.nccl_init
2025-05-07T20:29:42.0506627Z 
2025-05-07T20:29:42.1128861Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:45.9703944Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:45.9704165Z 
2025-05-07T20:29:46.0334814Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:49.8871819Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:49.8872113Z 
2025-05-07T20:29:49.9494921Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:49.9496215Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:49.9536908Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:49.9537365Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:49.9553990Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:49.9554339Z env:
2025-05-07T20:29:49.9554563Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:49.9554859Z   BUILD_ENV: build_binary
2025-05-07T20:29:49.9555106Z   BUILD_TARGET: genai
2025-05-07T20:29:49.9555337Z   BUILD_VARIANT: cuda
2025-05-07T20:29:49.9555581Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:49.9555832Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:49.9556130Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:49.9556459Z ##[endgroup]
2025-05-07T20:29:50.2941325Z ################################################################################
2025-05-07T20:29:50.2941724Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:50.2941984Z #
2025-05-07T20:29:50.2956970Z # [2025-05-07T20:29:50.295Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:50.2957376Z ################################################################################
2025-05-07T20:29:50.2957590Z 
2025-05-07T20:29:58.1654001Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:29:58.1654579Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:29:58.1654971Z [TEST] Determined the test directories:
2025-05-07T20:29:58.1655283Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:29:58.1655587Z fbgemm_gpu/experimental/example/test
2025-05-07T20:29:58.1655878Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:29:58.1656068Z 
2025-05-07T20:29:58.1660941Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:29:58.1670062Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:29:58.1670525Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:29:58.1670804Z 
2025-05-07T20:29:58.5888056Z 
2025-05-07T20:29:58.5888360Z [TEST] Installing PyTest ...
2025-05-07T20:29:58.5913168Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:29:59.6898073Z Channels:
2025-05-07T20:29:59.6898334Z  - conda-forge
2025-05-07T20:29:59.6898566Z Platform: linux-64
2025-05-07T20:30:02.9768227Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:04.1237616Z Solving environment: \ | / done
2025-05-07T20:30:04.3486179Z 
2025-05-07T20:30:04.3486794Z ## Package Plan ##
2025-05-07T20:30:04.3486975Z 
2025-05-07T20:30:04.3487183Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:04.3487486Z 
2025-05-07T20:30:04.3487584Z   added / updated specs:
2025-05-07T20:30:04.3487832Z     - expecttest
2025-05-07T20:30:04.3488087Z     - pytest
2025-05-07T20:30:04.3488210Z 
2025-05-07T20:30:04.3488215Z 
2025-05-07T20:30:04.3488336Z The following packages will be downloaded:
2025-05-07T20:30:04.3488565Z 
2025-05-07T20:30:04.3488680Z     package                    |            build
2025-05-07T20:30:04.3489000Z     ---------------------------|-----------------
2025-05-07T20:30:04.3489366Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:04.3490075Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:04.3490565Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:04.3491002Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:04.3491426Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:04.3491847Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:04.3492253Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:04.3493202Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:04.3493591Z     ------------------------------------------------------------
2025-05-07T20:30:04.3493935Z                                            Total:         428 KB
2025-05-07T20:30:04.3494140Z 
2025-05-07T20:30:04.3494277Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:04.3494491Z 
2025-05-07T20:30:04.3494693Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:04.3495198Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:04.3495719Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:04.3496189Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.3496645Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:04.3497103Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.3497532Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:04.3497960Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:04.3498214Z 
2025-05-07T20:30:04.3498218Z 
2025-05-07T20:30:04.3498223Z 
2025-05-07T20:30:04.3498367Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:04.3498738Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:04.3498966Z 
2025-05-07T20:30:04.3502414Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:04.3502657Z 
2025-05-07T20:30:04.3502661Z 
2025-05-07T20:30:04.3516705Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:04.3516991Z 
2025-05-07T20:30:04.3516997Z 
2025-05-07T20:30:04.3517002Z 
2025-05-07T20:30:04.3528111Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:04.3528359Z 
2025-05-07T20:30:04.3528363Z 
2025-05-07T20:30:04.3528387Z 
2025-05-07T20:30:04.3528391Z 
2025-05-07T20:30:04.3547035Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:04.3547325Z 
2025-05-07T20:30:04.3547329Z 
2025-05-07T20:30:04.3547333Z 
2025-05-07T20:30:04.3547343Z 
2025-05-07T20:30:04.3556957Z 
2025-05-07T20:30:04.3558455Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:04.3558700Z 
2025-05-07T20:30:04.3558707Z 
2025-05-07T20:30:04.3558718Z 
2025-05-07T20:30:04.3558722Z 
2025-05-07T20:30:04.3558725Z 
2025-05-07T20:30:04.3558774Z 
2025-05-07T20:30:04.3561110Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:04.3561404Z 
2025-05-07T20:30:04.3561409Z 
2025-05-07T20:30:04.3561413Z 
2025-05-07T20:30:04.3561417Z 
2025-05-07T20:30:04.3561420Z 
2025-05-07T20:30:04.3561424Z 
2025-05-07T20:30:04.3564721Z 
2025-05-07T20:30:04.5633404Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:04.5643017Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:04.5643292Z 
2025-05-07T20:30:04.5643297Z 
2025-05-07T20:30:04.5643312Z 
2025-05-07T20:30:04.5653292Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:04.5653607Z 
2025-05-07T20:30:04.5653614Z 
2025-05-07T20:30:04.5655353Z 
2025-05-07T20:30:04.5722451Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:04.5722705Z 
2025-05-07T20:30:04.5722709Z 
2025-05-07T20:30:04.5722713Z 
2025-05-07T20:30:04.5722717Z 
2025-05-07T20:30:04.5731193Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:04.5731622Z 
2025-05-07T20:30:04.5731626Z 
2025-05-07T20:30:04.5731630Z 
2025-05-07T20:30:04.5732666Z 
2025-05-07T20:30:04.5792234Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:04.5955707Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.5955947Z 
2025-05-07T20:30:04.5958979Z 
2025-05-07T20:30:04.5998982Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:04.5999454Z 
2025-05-07T20:30:04.6001285Z 
2025-05-07T20:30:04.6008804Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:04.6009175Z 
2025-05-07T20:30:04.6009180Z 
2025-05-07T20:30:04.6009184Z 
2025-05-07T20:30:04.6049964Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:04.6050297Z 
2025-05-07T20:30:04.6050302Z 
2025-05-07T20:30:04.6050307Z 
2025-05-07T20:30:04.6050312Z 
2025-05-07T20:30:04.6050905Z 
2025-05-07T20:30:04.6082053Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:04.6082425Z 
2025-05-07T20:30:04.6082430Z 
2025-05-07T20:30:04.6082435Z 
2025-05-07T20:30:04.6082438Z 
2025-05-07T20:30:04.6082917Z 
2025-05-07T20:30:04.6108120Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:04.6108686Z 
2025-05-07T20:30:04.6139325Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:04.6139684Z 
2025-05-07T20:30:04.6139688Z 
2025-05-07T20:30:04.6139692Z 
2025-05-07T20:30:04.6142823Z 
2025-05-07T20:30:04.6162073Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:04.6162863Z 
2025-05-07T20:30:04.6275361Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:04.6275814Z 
2025-05-07T20:30:04.6275821Z 
2025-05-07T20:30:04.6275826Z 
2025-05-07T20:30:04.6275832Z 
2025-05-07T20:30:04.6275837Z 
2025-05-07T20:30:04.6275843Z 
2025-05-07T20:30:04.6276165Z 
2025-05-07T20:30:04.6297307Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.6297709Z 
2025-05-07T20:30:04.6297713Z 
2025-05-07T20:30:04.6297717Z 
2025-05-07T20:30:04.6297730Z 
2025-05-07T20:30:04.6297734Z 
2025-05-07T20:30:04.6297737Z 
2025-05-07T20:30:04.6297741Z 
2025-05-07T20:30:04.6429043Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.6429322Z 
2025-05-07T20:30:04.6430798Z 
2025-05-07T20:30:04.6525579Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:04.6525866Z 
2025-05-07T20:30:04.6525874Z 
2025-05-07T20:30:04.6525880Z 
2025-05-07T20:30:04.6525911Z 
2025-05-07T20:30:04.6526030Z 
2025-05-07T20:30:04.6619351Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:04.6619622Z 
2025-05-07T20:30:04.6619627Z 
2025-05-07T20:30:04.6619631Z 
2025-05-07T20:30:04.6619634Z 
2025-05-07T20:30:04.6619645Z 
2025-05-07T20:30:04.6619649Z 
2025-05-07T20:30:04.6633880Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.6634161Z 
2025-05-07T20:30:04.6634165Z 
2025-05-07T20:30:04.6634169Z 
2025-05-07T20:30:04.6634180Z 
2025-05-07T20:30:04.6634184Z 
2025-05-07T20:30:04.6638703Z 
2025-05-07T20:30:04.6758392Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.6758740Z 
2025-05-07T20:30:04.6877534Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:04.6877896Z 
2025-05-07T20:30:04.6877913Z 
2025-05-07T20:30:04.6877917Z 
2025-05-07T20:30:04.6877920Z 
2025-05-07T20:30:04.6877924Z 
2025-05-07T20:30:04.6877928Z 
2025-05-07T20:30:04.6878306Z 
2025-05-07T20:30:04.6948388Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.6948783Z 
2025-05-07T20:30:04.6948787Z 
2025-05-07T20:30:04.6948791Z 
2025-05-07T20:30:04.6948802Z 
2025-05-07T20:30:04.6948806Z 
2025-05-07T20:30:04.6948809Z 
2025-05-07T20:30:04.6959422Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.6959968Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.6967148Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.6967491Z                                                      
2025-05-07T20:30:04.6967697Z 
2025-05-07T20:30:04.6967900Z                                                      [A
2025-05-07T20:30:04.6968129Z 
2025-05-07T20:30:04.6968134Z 
2025-05-07T20:30:04.6968299Z                                                      [A[A
2025-05-07T20:30:04.6968731Z 
2025-05-07T20:30:04.6968736Z 
2025-05-07T20:30:04.6968740Z 
2025-05-07T20:30:04.6969049Z                                                      [A[A[A
2025-05-07T20:30:04.6969263Z 
2025-05-07T20:30:04.6969267Z 
2025-05-07T20:30:04.6969271Z 
2025-05-07T20:30:04.6969274Z 
2025-05-07T20:30:04.6969467Z                                                      [A[A[A[A
2025-05-07T20:30:04.6969673Z 
2025-05-07T20:30:04.6969677Z 
2025-05-07T20:30:04.6969680Z 
2025-05-07T20:30:04.6969684Z 
2025-05-07T20:30:04.6969688Z 
2025-05-07T20:30:04.6969872Z                                                      [A[A[A[A[A
2025-05-07T20:30:04.6970081Z 
2025-05-07T20:30:04.6970085Z 
2025-05-07T20:30:04.6970089Z 
2025-05-07T20:30:04.6970093Z 
2025-05-07T20:30:04.6970096Z 
2025-05-07T20:30:04.6970100Z 
2025-05-07T20:30:04.6970286Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:04.6970497Z 
2025-05-07T20:30:04.6970507Z 
2025-05-07T20:30:04.6970511Z 
2025-05-07T20:30:04.6970515Z 
2025-05-07T20:30:04.6970518Z 
2025-05-07T20:30:04.6970522Z 
2025-05-07T20:30:04.6970526Z 
2025-05-07T20:30:04.6970731Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:04.7973328Z Preparing transaction: \ done
2025-05-07T20:30:04.8980265Z Verifying transaction: / done
2025-05-07T20:30:06.7007509Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:30:06.8297146Z [TEST] Checking imports ...
2025-05-07T20:30:10.7292616Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:10.7307265Z [TEST] Setting feature flags ...
2025-05-07T20:30:10.7307890Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:10.7308288Z 
2025-05-07T20:30:11.1571652Z 
2025-05-07T20:30:11.1572160Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:11.1583728Z ################################################################################
2025-05-07T20:30:11.1584142Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:11.1584398Z #
2025-05-07T20:30:11.1591483Z # [2025-05-07T20:30:11.158Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:11.1591914Z ################################################################################
2025-05-07T20:30:11.1592139Z 
2025-05-07T20:30:11.1598592Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:11.1629834Z ./attention/gqa_test.py
2025-05-07T20:30:11.1630116Z ./coalesce/coalesce_test.py
2025-05-07T20:30:11.1630383Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:11.1630664Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:11.1630967Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:11.1631219Z ./moe/activation_test.py
2025-05-07T20:30:11.1631475Z ./moe/gather_scatter_test.py
2025-05-07T20:30:11.1631731Z ./moe/layers_test.py
2025-05-07T20:30:11.1631967Z ./moe/shuffling_test.py
2025-05-07T20:30:11.1632209Z ./quantize/quantize_test.py
2025-05-07T20:30:11.1632389Z 
2025-05-07T20:30:11.1632506Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:11.1632714Z 
2025-05-07T20:30:11.1650130Z ################################################################################
2025-05-07T20:30:11.1665362Z # [2025-05-07T20:30:11.166Z] Run Python Test Suite:
2025-05-07T20:30:11.1665697Z #   ./attention/gqa_test.py
2025-05-07T20:30:11.1665979Z ################################################################################
2025-05-07T20:30:11.1689677Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:11.1690512Z 
2025-05-07T20:30:13.6971393Z ============================= test session starts ==============================
2025-05-07T20:30:13.6972057Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:13.6972590Z cachedir: .pytest_cache
2025-05-07T20:30:13.6973708Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:13.6974432Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:13.6974841Z plugins: hypothesis-6.131.14
2025-05-07T20:30:15.2174651Z collecting ... collected 2 items
2025-05-07T20:30:15.2174863Z 
2025-05-07T20:30:52.0340631Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:52.0343376Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0343800Z     int4_kv=False,
2025-05-07T20:30:52.0344114Z     num_groups=1,
2025-05-07T20:30:52.0344381Z     B=1,
2025-05-07T20:30:52.0344619Z     MAX_T=4,
2025-05-07T20:30:52.0344856Z     N_H_L=1,
2025-05-07T20:30:52.0345105Z )
2025-05-07T20:30:52.0345355Z Trying example: test_gqa(
2025-05-07T20:30:52.0345708Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0346133Z     int4_kv=True,
2025-05-07T20:30:52.0346400Z     num_groups=1,
2025-05-07T20:30:52.0346649Z     B=1,
2025-05-07T20:30:52.0346882Z     MAX_T=4,
2025-05-07T20:30:52.0347139Z     N_H_L=1,
2025-05-07T20:30:52.0347370Z )
2025-05-07T20:30:52.0347617Z Trying example: test_gqa(
2025-05-07T20:30:52.0347979Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0348352Z     int4_kv=True,
2025-05-07T20:30:52.0348612Z     num_groups=4,
2025-05-07T20:30:52.0348868Z     B=23,
2025-05-07T20:30:52.0349110Z     MAX_T=33,
2025-05-07T20:30:52.0349368Z     N_H_L=68,
2025-05-07T20:30:52.0349625Z )
2025-05-07T20:30:52.0349916Z Trying example: test_gqa(
2025-05-07T20:30:52.0350268Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0350639Z     int4_kv=True,
2025-05-07T20:30:52.0350894Z     num_groups=4,
2025-05-07T20:30:52.0351145Z     B=77,
2025-05-07T20:30:52.0351363Z     MAX_T=4,
2025-05-07T20:30:52.0351611Z     N_H_L=1,
2025-05-07T20:30:52.0351846Z )
2025-05-07T20:30:52.0352073Z Trying example: test_gqa(
2025-05-07T20:30:52.0352431Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0352812Z     int4_kv=True,
2025-05-07T20:30:52.0353067Z     num_groups=4,
2025-05-07T20:30:52.0353322Z     B=77,
2025-05-07T20:30:52.0353547Z     MAX_T=52,
2025-05-07T20:30:52.0353776Z     N_H_L=67,
2025-05-07T20:30:52.0354012Z )
2025-05-07T20:30:52.0354322Z Trying example: test_gqa(
2025-05-07T20:30:52.0354676Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0355046Z     int4_kv=False,
2025-05-07T20:30:52.0355307Z     num_groups=4,
2025-05-07T20:30:52.0355555Z     B=57,
2025-05-07T20:30:52.0355773Z     MAX_T=45,
2025-05-07T20:30:52.0356015Z     N_H_L=120,
2025-05-07T20:30:52.0356254Z )
2025-05-07T20:30:52.0356483Z Trying example: test_gqa(
2025-05-07T20:30:52.0356833Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0357211Z     int4_kv=True,
2025-05-07T20:30:52.0357458Z     num_groups=4,
2025-05-07T20:30:52.0357704Z     B=52,
2025-05-07T20:30:52.0357939Z     MAX_T=42,
2025-05-07T20:30:52.0358166Z     N_H_L=53,
2025-05-07T20:30:52.0358397Z )
2025-05-07T20:30:52.0358631Z Trying example: test_gqa(
2025-05-07T20:30:52.0358984Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0359363Z     int4_kv=True,
2025-05-07T20:30:52.0359617Z     num_groups=1,
2025-05-07T20:30:52.0359857Z     B=77,
2025-05-07T20:30:52.0360086Z     MAX_T=95,
2025-05-07T20:30:52.0360322Z     N_H_L=53,
2025-05-07T20:30:52.0360548Z )
2025-05-07T20:30:52.0360786Z Trying example: test_gqa(
2025-05-07T20:30:52.0361137Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0361518Z     int4_kv=True,
2025-05-07T20:30:52.0361764Z     num_groups=4,
2025-05-07T20:30:52.0362016Z     B=113,
2025-05-07T20:30:52.0362244Z     MAX_T=48,
2025-05-07T20:30:52.0362476Z     N_H_L=96,
2025-05-07T20:30:52.0362709Z )
2025-05-07T20:30:52.0362940Z Trying example: test_gqa(
2025-05-07T20:30:52.0363282Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0364083Z     int4_kv=False,
2025-05-07T20:30:52.0364345Z     num_groups=1,
2025-05-07T20:30:52.0364586Z     B=51,
2025-05-07T20:30:52.0365001Z     MAX_T=61,
2025-05-07T20:30:52.0365249Z     N_H_L=69,
2025-05-07T20:30:52.0365473Z )
2025-05-07T20:30:52.0365708Z Trying example: test_gqa(
2025-05-07T20:30:52.0366056Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0366427Z     int4_kv=False,
2025-05-07T20:30:52.0366686Z     num_groups=4,
2025-05-07T20:30:52.0366933Z     B=17,
2025-05-07T20:30:52.0367155Z     MAX_T=113,
2025-05-07T20:30:52.0367399Z     N_H_L=65,
2025-05-07T20:30:52.0367629Z )
2025-05-07T20:30:52.0367856Z Trying example: test_gqa(
2025-05-07T20:30:52.0368205Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0368582Z     int4_kv=False,
2025-05-07T20:30:52.0368831Z     num_groups=4,
2025-05-07T20:30:52.0369082Z     B=17,
2025-05-07T20:30:52.0369310Z     MAX_T=65,
2025-05-07T20:30:52.0369540Z     N_H_L=65,
2025-05-07T20:30:52.0369777Z )
2025-05-07T20:30:52.0370048Z Trying example: test_gqa(
2025-05-07T20:30:52.0370415Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0370807Z     int4_kv=False,
2025-05-07T20:30:52.0371064Z     num_groups=4,
2025-05-07T20:30:52.0371309Z     B=65,
2025-05-07T20:30:52.0371530Z     MAX_T=65,
2025-05-07T20:30:52.0371767Z     N_H_L=65,
2025-05-07T20:30:52.0371997Z )
2025-05-07T20:30:52.0372224Z Trying example: test_gqa(
2025-05-07T20:30:52.0372577Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0372954Z     int4_kv=False,
2025-05-07T20:30:52.0373202Z     num_groups=1,
2025-05-07T20:30:52.0373450Z     B=6,
2025-05-07T20:30:52.0373678Z     MAX_T=108,
2025-05-07T20:30:52.0373915Z     N_H_L=14,
2025-05-07T20:30:52.0374149Z )
2025-05-07T20:30:52.0374388Z Trying example: test_gqa(
2025-05-07T20:30:52.0374728Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0375106Z     int4_kv=False,
2025-05-07T20:30:52.0375364Z     num_groups=1,
2025-05-07T20:30:52.0375611Z     B=6,
2025-05-07T20:30:52.0375840Z     MAX_T=14,
2025-05-07T20:30:52.0376081Z     N_H_L=14,
2025-05-07T20:30:52.0376308Z )
2025-05-07T20:30:52.0376553Z Trying example: test_gqa(
2025-05-07T20:30:52.0376905Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0377277Z     int4_kv=False,
2025-05-07T20:30:52.0377532Z     num_groups=1,
2025-05-07T20:30:52.0377779Z     B=6,
2025-05-07T20:30:52.0377997Z     MAX_T=6,
2025-05-07T20:30:52.0378230Z     N_H_L=14,
2025-05-07T20:30:52.0378462Z )
2025-05-07T20:30:52.0378690Z Trying example: test_gqa(
2025-05-07T20:30:52.0379040Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0379418Z     int4_kv=False,
2025-05-07T20:30:52.0379675Z     num_groups=1,
2025-05-07T20:30:52.0380052Z     B=6,
2025-05-07T20:30:52.0380282Z     MAX_T=6,
2025-05-07T20:30:52.0380515Z     N_H_L=6,
2025-05-07T20:30:52.0380738Z )
2025-05-07T20:30:52.0380974Z Trying example: test_gqa(
2025-05-07T20:30:52.0381321Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0381703Z     int4_kv=False,
2025-05-07T20:30:52.0381965Z     num_groups=1,
2025-05-07T20:30:52.0382211Z     B=70,
2025-05-07T20:30:52.0382441Z     MAX_T=94,
2025-05-07T20:30:52.0382678Z     N_H_L=78,
2025-05-07T20:30:52.0382912Z )
2025-05-07T20:30:52.0383142Z Trying example: test_gqa(
2025-05-07T20:30:52.0383491Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0383868Z     int4_kv=False,
2025-05-07T20:30:52.0384120Z     num_groups=1,
2025-05-07T20:30:52.0384367Z     B=78,
2025-05-07T20:30:52.0384594Z     MAX_T=94,
2025-05-07T20:30:52.0384825Z     N_H_L=78,
2025-05-07T20:30:52.0385058Z )
2025-05-07T20:30:52.0385291Z Trying example: test_gqa(
2025-05-07T20:30:52.0385632Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0386011Z     int4_kv=False,
2025-05-07T20:30:52.0386267Z     num_groups=1,
2025-05-07T20:30:52.0386510Z     B=94,
2025-05-07T20:30:52.0386740Z     MAX_T=94,
2025-05-07T20:30:52.0386975Z     N_H_L=78,
2025-05-07T20:30:52.0387965Z )
2025-05-07T20:30:52.0388206Z Trying example: test_gqa(
2025-05-07T20:30:52.0388559Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0389031Z     int4_kv=False,
2025-05-07T20:30:52.0389289Z     num_groups=1,
2025-05-07T20:30:52.0389550Z     B=94,
2025-05-07T20:30:52.0389790Z     MAX_T=94,
2025-05-07T20:30:52.0390247Z     N_H_L=94,
2025-05-07T20:30:52.0390440Z )
2025-05-07T20:30:52.0390633Z Trying example: test_gqa(
2025-05-07T20:30:52.0390919Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0391228Z     int4_kv=False,
2025-05-07T20:30:52.0391438Z     num_groups=4,
2025-05-07T20:30:52.0391634Z     B=41,
2025-05-07T20:30:52.0391822Z     MAX_T=105,
2025-05-07T20:30:52.0392023Z     N_H_L=126,
2025-05-07T20:30:52.0392211Z )
2025-05-07T20:30:52.0392410Z Trying example: test_gqa(
2025-05-07T20:30:52.0392697Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0392998Z     int4_kv=False,
2025-05-07T20:30:52.0393206Z     num_groups=4,
2025-05-07T20:30:52.0393419Z     B=105,
2025-05-07T20:30:52.0393600Z     MAX_T=105,
2025-05-07T20:30:52.0393801Z     N_H_L=126,
2025-05-07T20:30:52.0394000Z )
2025-05-07T20:30:52.0394187Z Trying example: test_gqa(
2025-05-07T20:30:52.0394476Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0394789Z     int4_kv=False,
2025-05-07T20:30:52.0394990Z     num_groups=4,
2025-05-07T20:30:52.0395192Z     B=105,
2025-05-07T20:30:52.0395379Z     MAX_T=105,
2025-05-07T20:30:52.0395574Z     N_H_L=105,
2025-05-07T20:30:52.0395774Z )
2025-05-07T20:30:52.0395970Z Trying example: test_gqa(
2025-05-07T20:30:52.0396253Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0396562Z     int4_kv=True,
2025-05-07T20:30:52.0396767Z     num_groups=1,
2025-05-07T20:30:52.0396967Z     B=95,
2025-05-07T20:30:52.0397148Z     MAX_T=114,
2025-05-07T20:30:52.0397345Z     N_H_L=43,
2025-05-07T20:30:52.0397532Z )
2025-05-07T20:30:52.0397721Z Trying example: test_gqa(
2025-05-07T20:30:52.0398019Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0398322Z     int4_kv=True,
2025-05-07T20:30:52.0398530Z     num_groups=1,
2025-05-07T20:30:52.0398739Z     B=43,
2025-05-07T20:30:52.0398929Z     MAX_T=114,
2025-05-07T20:30:52.0399123Z     N_H_L=43,
2025-05-07T20:30:52.0399316Z )
2025-05-07T20:30:52.0399513Z Trying example: test_gqa(
2025-05-07T20:30:52.0399802Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0400111Z     int4_kv=True,
2025-05-07T20:30:52.0400319Z     num_groups=1,
2025-05-07T20:30:52.0400514Z     B=43,
2025-05-07T20:30:52.0400705Z     MAX_T=43,
2025-05-07T20:30:52.0400897Z     N_H_L=43,
2025-05-07T20:30:52.0401083Z )
2025-05-07T20:30:52.0401275Z Trying example: test_gqa(
2025-05-07T20:30:52.0401566Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0401868Z     int4_kv=False,
2025-05-07T20:30:52.0402076Z     num_groups=1,
2025-05-07T20:30:52.0402278Z     B=21,
2025-05-07T20:30:52.0402460Z     MAX_T=38,
2025-05-07T20:30:52.0402658Z     N_H_L=42,
2025-05-07T20:30:52.0402851Z )
2025-05-07T20:30:52.0403049Z Trying example: test_gqa(
2025-05-07T20:30:52.0403339Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0403655Z     int4_kv=False,
2025-05-07T20:30:52.0403864Z     num_groups=1,
2025-05-07T20:30:52.0404060Z     B=38,
2025-05-07T20:30:52.0404246Z     MAX_T=38,
2025-05-07T20:30:52.0404441Z     N_H_L=42,
2025-05-07T20:30:52.0404622Z )
2025-05-07T20:30:52.0404816Z Trying example: test_gqa(
2025-05-07T20:30:52.0405105Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0405405Z     int4_kv=False,
2025-05-07T20:30:52.0405614Z     num_groups=1,
2025-05-07T20:30:52.0405816Z     B=38,
2025-05-07T20:30:52.0405998Z     MAX_T=42,
2025-05-07T20:30:52.0406194Z     N_H_L=42,
2025-05-07T20:30:52.0406385Z )
2025-05-07T20:30:52.0406571Z Trying example: test_gqa(
2025-05-07T20:30:52.0406871Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0407186Z     int4_kv=False,
2025-05-07T20:30:52.0407599Z     num_groups=1,
2025-05-07T20:30:52.0407811Z     B=42,
2025-05-07T20:30:52.0408004Z     MAX_T=42,
2025-05-07T20:30:52.0408321Z     N_H_L=42,
2025-05-07T20:30:52.0408523Z )
2025-05-07T20:30:52.0408725Z Trying example: test_gqa(
2025-05-07T20:30:52.0409016Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0409332Z     int4_kv=True,
2025-05-07T20:30:52.0409550Z     num_groups=1,
2025-05-07T20:30:52.0409760Z     B=74,
2025-05-07T20:30:52.0409947Z     MAX_T=20,
2025-05-07T20:30:52.0410147Z     N_H_L=15,
2025-05-07T20:30:52.0410344Z )
2025-05-07T20:30:52.0410539Z Trying example: test_gqa(
2025-05-07T20:30:52.0410837Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0411153Z     int4_kv=True,
2025-05-07T20:30:52.0411362Z     num_groups=1,
2025-05-07T20:30:52.0411570Z     B=20,
2025-05-07T20:30:52.0411763Z     MAX_T=20,
2025-05-07T20:30:52.0411955Z     N_H_L=15,
2025-05-07T20:30:52.0412152Z )
2025-05-07T20:30:52.0412348Z Trying example: test_gqa(
2025-05-07T20:30:52.0412646Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0412958Z     int4_kv=True,
2025-05-07T20:30:52.0413174Z     num_groups=1,
2025-05-07T20:30:52.0413379Z     B=20,
2025-05-07T20:30:52.0413571Z     MAX_T=15,
2025-05-07T20:30:52.0413772Z     N_H_L=15,
2025-05-07T20:30:52.0413961Z )
2025-05-07T20:30:52.0414158Z Trying example: test_gqa(
2025-05-07T20:30:52.0414454Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0414761Z     int4_kv=True,
2025-05-07T20:30:52.0414974Z     num_groups=1,
2025-05-07T20:30:52.0415181Z     B=15,
2025-05-07T20:30:52.0415365Z     MAX_T=20,
2025-05-07T20:30:52.0415564Z     N_H_L=15,
2025-05-07T20:30:52.0415758Z )
2025-05-07T20:30:52.0415951Z Trying example: test_gqa(
2025-05-07T20:30:52.0416251Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0416562Z     int4_kv=True,
2025-05-07T20:30:52.0416779Z     num_groups=1,
2025-05-07T20:30:52.0416980Z     B=15,
2025-05-07T20:30:52.0417180Z     MAX_T=15,
2025-05-07T20:30:52.0417382Z     N_H_L=15,
2025-05-07T20:30:52.0417572Z )
2025-05-07T20:30:52.0417769Z Trying example: test_gqa(
2025-05-07T20:30:52.0418073Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0418383Z     int4_kv=False,
2025-05-07T20:30:52.0418600Z     num_groups=4,
2025-05-07T20:30:52.0418811Z     B=117,
2025-05-07T20:30:52.0419001Z     MAX_T=104,
2025-05-07T20:30:52.0419213Z     N_H_L=69,
2025-05-07T20:30:52.0419412Z )
2025-05-07T20:30:52.0419604Z Trying example: test_gqa(
2025-05-07T20:30:52.0420045Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0420400Z     int4_kv=False,
2025-05-07T20:30:52.0420609Z     num_groups=4,
2025-05-07T20:30:52.0420820Z     B=117,
2025-05-07T20:30:52.0421017Z     MAX_T=117,
2025-05-07T20:30:52.0421219Z     N_H_L=69,
2025-05-07T20:30:52.0421416Z )
2025-05-07T20:30:52.0421615Z Trying example: test_gqa(
2025-05-07T20:30:52.0421903Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0422228Z     int4_kv=False,
2025-05-07T20:30:52.0422442Z     num_groups=4,
2025-05-07T20:30:52.0422643Z     B=69,
2025-05-07T20:30:52.0422837Z     MAX_T=117,
2025-05-07T20:30:52.0423045Z     N_H_L=69,
2025-05-07T20:30:52.0423237Z )
2025-05-07T20:30:52.0423434Z Trying example: test_gqa(
2025-05-07T20:30:52.0423725Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:52.0424031Z     int4_kv=False,
2025-05-07T20:30:52.0424245Z     num_groups=4,
2025-05-07T20:30:52.0424453Z     B=117,
2025-05-07T20:30:52.0424645Z     MAX_T=69,
2025-05-07T20:30:52.0424838Z     N_H_L=69,
2025-05-07T20:30:52.0425037Z )
2025-05-07T20:30:52.0425231Z PASSED
2025-05-07T20:30:52.0695293Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:52.0695633Z 
2025-05-07T20:30:52.0695785Z =========================== short test summary info ============================
2025-05-07T20:30:52.0696496Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:52.0697505Z ======================== 1 passed, 1 skipped in 38.88s =========================
2025-05-07T20:30:52.7073836Z 
2025-05-07T20:30:52.7074656Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:52.7094563Z [TEST] Python test time for ./attention/gqa_test.py: 41 seconds
2025-05-07T20:30:52.7094866Z 
2025-05-07T20:30:52.7095019Z 
2025-05-07T20:30:52.7095025Z 
2025-05-07T20:30:52.7095064Z 
2025-05-07T20:30:52.7123368Z ################################################################################
2025-05-07T20:30:52.7130959Z # [2025-05-07T20:30:52.712Z] Run Python Test Suite:
2025-05-07T20:30:52.7131308Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:52.7131607Z ################################################################################
2025-05-07T20:30:52.7157408Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:52.7158316Z 
2025-05-07T20:30:54.8561674Z ============================= test session starts ==============================
2025-05-07T20:30:54.8562541Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:54.8563071Z cachedir: .pytest_cache
2025-05-07T20:30:54.8563646Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:54.8564354Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:54.8564771Z plugins: hypothesis-6.131.14
2025-05-07T20:30:56.4267188Z collecting ... collected 1 item
2025-05-07T20:30:56.4267539Z 
2025-05-07T20:30:57.1638560Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:30:57.1639027Z 
2025-05-07T20:30:57.1639228Z ============================== 1 passed in 2.43s ===============================
2025-05-07T20:30:57.7853545Z 
2025-05-07T20:30:57.7854255Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:30:57.7873595Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:30:57.7874016Z 
2025-05-07T20:30:57.7874023Z 
2025-05-07T20:30:57.7874028Z 
2025-05-07T20:30:57.7874033Z 
2025-05-07T20:30:57.7895531Z ################################################################################
2025-05-07T20:30:57.7911003Z # [2025-05-07T20:30:57.790Z] Run Python Test Suite:
2025-05-07T20:30:57.7911480Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:30:57.7911881Z ################################################################################
2025-05-07T20:30:57.7935479Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:30:57.7936212Z 
2025-05-07T20:30:59.9290802Z ============================= test session starts ==============================
2025-05-07T20:30:59.9291625Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:59.9292165Z cachedir: .pytest_cache
2025-05-07T20:30:59.9292738Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:59.9293457Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:59.9293870Z plugins: hypothesis-6.131.14
2025-05-07T20:31:01.5131479Z collecting ... collected 5 items
2025-05-07T20:31:01.5131770Z 
2025-05-07T20:31:01.5142365Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:01.5150569Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:01.5158067Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:01.5165464Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:01.5181343Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:01.5181676Z 
2025-05-07T20:31:01.5182031Z =========================== short test summary info ============================
2025-05-07T20:31:01.5182701Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.5183633Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.5184541Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.5185449Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.5186353Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:01.5186997Z ============================== 5 skipped in 1.71s ==============================
2025-05-07T20:31:02.0714264Z 
2025-05-07T20:31:02.0718874Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:02.0735948Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:02.0736241Z 
2025-05-07T20:31:02.0736245Z 
2025-05-07T20:31:02.0736261Z 
2025-05-07T20:31:02.0736265Z 
2025-05-07T20:31:02.0757797Z ################################################################################
2025-05-07T20:31:02.0773944Z # [2025-05-07T20:31:02.077Z] Run Python Test Suite:
2025-05-07T20:31:02.0774288Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:02.0774607Z ################################################################################
2025-05-07T20:31:02.0799181Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:02.0799977Z 
2025-05-07T20:31:04.2262517Z ============================= test session starts ==============================
2025-05-07T20:31:04.2263183Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:04.2263715Z cachedir: .pytest_cache
2025-05-07T20:31:04.2264308Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:04.2265037Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:04.2265442Z plugins: hypothesis-6.131.14
2025-05-07T20:31:05.8815250Z collecting ... collected 2 items
2025-05-07T20:31:05.8815464Z 
2025-05-07T20:31:05.8827233Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:05.8841796Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:05.8842219Z 
2025-05-07T20:31:05.8842392Z =========================== short test summary info ============================
2025-05-07T20:31:05.8843026Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:05.8843854Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:05.8844443Z ============================== 2 skipped in 1.78s ==============================
2025-05-07T20:31:06.4460044Z 
2025-05-07T20:31:06.4461018Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:06.4480670Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:31:06.4480996Z 
2025-05-07T20:31:06.4481000Z 
2025-05-07T20:31:06.4481004Z 
2025-05-07T20:31:06.4481405Z 
2025-05-07T20:31:06.4505280Z ################################################################################
2025-05-07T20:31:06.4521023Z # [2025-05-07T20:31:06.451Z] Run Python Test Suite:
2025-05-07T20:31:06.4521366Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:06.4521648Z ################################################################################
2025-05-07T20:31:06.4546115Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:06.4546730Z 
2025-05-07T20:31:08.5923822Z ============================= test session starts ==============================
2025-05-07T20:31:08.5924461Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:08.5925097Z cachedir: .pytest_cache
2025-05-07T20:31:08.5926275Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:08.5927722Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:08.5928526Z plugins: hypothesis-6.131.14
2025-05-07T20:31:10.1592638Z collecting ... collected 4 items
2025-05-07T20:31:10.1592852Z 
2025-05-07T20:31:12.9870380Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:13.0001564Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:13.0156042Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:13.0287462Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:13.0287818Z 
2025-05-07T20:31:13.0287979Z =========================== short test summary info ============================
2025-05-07T20:31:13.0288679Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:13.0289625Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when xformers is not available
2025-05-07T20:31:13.0290464Z ============================== 4 skipped in 4.56s ==============================
2025-05-07T20:31:14.9077500Z 
2025-05-07T20:31:14.9078509Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:14.9097909Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:14.9098302Z 
2025-05-07T20:31:14.9098313Z 
2025-05-07T20:31:14.9098317Z 
2025-05-07T20:31:14.9098323Z 
2025-05-07T20:31:14.9119763Z ################################################################################
2025-05-07T20:31:14.9135400Z # [2025-05-07T20:31:14.913Z] Run Python Test Suite:
2025-05-07T20:31:14.9135841Z #   ./moe/activation_test.py
2025-05-07T20:31:14.9136213Z ################################################################################
2025-05-07T20:31:14.9160556Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:14.9161170Z 
2025-05-07T20:31:17.0756115Z ============================= test session starts ==============================
2025-05-07T20:31:17.0756758Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:17.0757289Z cachedir: .pytest_cache
2025-05-07T20:31:17.0758055Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:17.0759152Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:17.0759768Z plugins: hypothesis-6.131.14
2025-05-07T20:31:18.7121040Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:18.8897293Z collecting ... collected 2 items
2025-05-07T20:31:18.8897926Z 
2025-05-07T20:31:24.3174316Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:24.3175382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3175780Z     T=1,
2025-05-07T20:31:24.3175973Z     D=5120,
2025-05-07T20:31:24.3176226Z     contiguous=True,
2025-05-07T20:31:24.3176557Z     compiled=True,
2025-05-07T20:31:24.3176833Z )
2025-05-07T20:31:24.3177029Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3177477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3178020Z     T=4096,
2025-05-07T20:31:24.3178295Z     D=5120,
2025-05-07T20:31:24.3178584Z     contiguous=True,
2025-05-07T20:31:24.3178893Z     compiled=True,
2025-05-07T20:31:24.3179169Z )
2025-05-07T20:31:24.3179436Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3180082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3180615Z     T=4096,
2025-05-07T20:31:24.3180829Z     D=7168,
2025-05-07T20:31:24.3181035Z     contiguous=False,
2025-05-07T20:31:24.3181263Z     compiled=False,
2025-05-07T20:31:24.3181472Z )
2025-05-07T20:31:24.3181682Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3182059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3182437Z     T=4096,
2025-05-07T20:31:24.3182629Z     D=5120,
2025-05-07T20:31:24.3182827Z     contiguous=False,
2025-05-07T20:31:24.3183053Z     compiled=True,
2025-05-07T20:31:24.3183258Z )
2025-05-07T20:31:24.3183456Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3183820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3184252Z     T=1,
2025-05-07T20:31:24.3184528Z     D=7168,
2025-05-07T20:31:24.3184801Z     contiguous=True,
2025-05-07T20:31:24.3185110Z     compiled=True,
2025-05-07T20:31:24.3185394Z )
2025-05-07T20:31:24.3185621Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3185993Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3186374Z     T=1,
2025-05-07T20:31:24.3186550Z     D=7168,
2025-05-07T20:31:24.3186753Z     contiguous=False,
2025-05-07T20:31:24.3186984Z     compiled=True,
2025-05-07T20:31:24.3187185Z )
2025-05-07T20:31:24.3187388Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3187760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3188143Z     T=4096,
2025-05-07T20:31:24.3188323Z     D=5120,
2025-05-07T20:31:24.3188525Z     contiguous=False,
2025-05-07T20:31:24.3188752Z     compiled=False,
2025-05-07T20:31:24.3188951Z )
2025-05-07T20:31:24.3189153Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3189528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3190329Z     T=1,
2025-05-07T20:31:24.3190573Z     D=7168,
2025-05-07T20:31:24.3190776Z     contiguous=True,
2025-05-07T20:31:24.3190995Z     compiled=False,
2025-05-07T20:31:24.3191204Z )
2025-05-07T20:31:24.3191420Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3191789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3192175Z     T=2048,
2025-05-07T20:31:24.3192370Z     D=5120,
2025-05-07T20:31:24.3192561Z     contiguous=True,
2025-05-07T20:31:24.3192787Z     compiled=True,
2025-05-07T20:31:24.3192995Z )
2025-05-07T20:31:24.3193193Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3193566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3193943Z     T=2048,
2025-05-07T20:31:24.3194136Z     D=7168,
2025-05-07T20:31:24.3194327Z     contiguous=True,
2025-05-07T20:31:24.3194551Z     compiled=True,
2025-05-07T20:31:24.3194754Z )
2025-05-07T20:31:24.3194952Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3195323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3195694Z     T=2048,
2025-05-07T20:31:24.3195882Z     D=7168,
2025-05-07T20:31:24.3196083Z     contiguous=True,
2025-05-07T20:31:24.3196546Z     compiled=False,
2025-05-07T20:31:24.3196742Z )
2025-05-07T20:31:24.3196941Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3197446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3197822Z     T=128,
2025-05-07T20:31:24.3198006Z     D=5120,
2025-05-07T20:31:24.3198205Z     contiguous=False,
2025-05-07T20:31:24.3198428Z     compiled=True,
2025-05-07T20:31:24.3198632Z )
2025-05-07T20:31:24.3198835Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3199199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3199574Z     T=128,
2025-05-07T20:31:24.3199760Z     D=5120,
2025-05-07T20:31:24.3199953Z     contiguous=True,
2025-05-07T20:31:24.3200180Z     compiled=True,
2025-05-07T20:31:24.3200386Z )
2025-05-07T20:31:24.3200582Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3200955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3201330Z     T=16384,
2025-05-07T20:31:24.3201530Z     D=5120,
2025-05-07T20:31:24.3201725Z     contiguous=False,
2025-05-07T20:31:24.3201951Z     compiled=True,
2025-05-07T20:31:24.3202160Z )
2025-05-07T20:31:24.3202358Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3202732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3203108Z     T=16384,
2025-05-07T20:31:24.3203298Z     D=5120,
2025-05-07T20:31:24.3203496Z     contiguous=False,
2025-05-07T20:31:24.3203725Z     compiled=False,
2025-05-07T20:31:24.3203924Z )
2025-05-07T20:31:24.3204133Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3204505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3204876Z     T=128,
2025-05-07T20:31:24.3205068Z     D=7168,
2025-05-07T20:31:24.3205268Z     contiguous=True,
2025-05-07T20:31:24.3205487Z     compiled=False,
2025-05-07T20:31:24.3205695Z )
2025-05-07T20:31:24.3205897Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3206267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3206649Z     T=128,
2025-05-07T20:31:24.3206839Z     D=7168,
2025-05-07T20:31:24.3207047Z     contiguous=False,
2025-05-07T20:31:24.3207273Z     compiled=False,
2025-05-07T20:31:24.3207481Z )
2025-05-07T20:31:24.3207684Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3208053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3208428Z     T=1,
2025-05-07T20:31:24.3208616Z     D=5120,
2025-05-07T20:31:24.3208810Z     contiguous=False,
2025-05-07T20:31:24.3209039Z     compiled=False,
2025-05-07T20:31:24.3209242Z )
2025-05-07T20:31:24.3209442Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3209850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3210235Z     T=1,
2025-05-07T20:31:24.3210425Z     D=7168,
2025-05-07T20:31:24.3210625Z     contiguous=False,
2025-05-07T20:31:24.3210848Z     compiled=False,
2025-05-07T20:31:24.3211057Z )
2025-05-07T20:31:24.3211265Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3211636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3212010Z     T=4096,
2025-05-07T20:31:24.3212205Z     D=5120,
2025-05-07T20:31:24.3212409Z     contiguous=True,
2025-05-07T20:31:24.3212630Z     compiled=False,
2025-05-07T20:31:24.3212839Z )
2025-05-07T20:31:24.3213039Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3213405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3213787Z     T=128,
2025-05-07T20:31:24.3213972Z     D=7168,
2025-05-07T20:31:24.3214164Z     contiguous=True,
2025-05-07T20:31:24.3214387Z     compiled=True,
2025-05-07T20:31:24.3214592Z )
2025-05-07T20:31:24.3214789Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3215158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3215538Z     T=1,
2025-05-07T20:31:24.3215714Z     D=5120,
2025-05-07T20:31:24.3215915Z     contiguous=False,
2025-05-07T20:31:24.3216239Z     compiled=True,
2025-05-07T20:31:24.3216435Z )
2025-05-07T20:31:24.3216637Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3217098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3217473Z     T=4096,
2025-05-07T20:31:24.3217669Z     D=7168,
2025-05-07T20:31:24.3217872Z     contiguous=True,
2025-05-07T20:31:24.3218099Z     compiled=False,
2025-05-07T20:31:24.3218298Z )
2025-05-07T20:31:24.3218504Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3218876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3219245Z     T=4096,
2025-05-07T20:31:24.3219438Z     D=7168,
2025-05-07T20:31:24.3219642Z     contiguous=False,
2025-05-07T20:31:24.3219992Z     compiled=True,
2025-05-07T20:31:24.3220267Z )
2025-05-07T20:31:24.3220543Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3221050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3221569Z     T=128,
2025-05-07T20:31:24.3221831Z     D=5120,
2025-05-07T20:31:24.3222087Z     contiguous=True,
2025-05-07T20:31:24.3222602Z     compiled=False,
2025-05-07T20:31:24.3222896Z )
2025-05-07T20:31:24.3223124Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3223496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3223869Z     T=128,
2025-05-07T20:31:24.3224049Z     D=5120,
2025-05-07T20:31:24.3224251Z     contiguous=False,
2025-05-07T20:31:24.3224478Z     compiled=False,
2025-05-07T20:31:24.3224678Z )
2025-05-07T20:31:24.3224882Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3225258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3225630Z     T=1,
2025-05-07T20:31:24.3225806Z     D=5120,
2025-05-07T20:31:24.3226008Z     contiguous=True,
2025-05-07T20:31:24.3226234Z     compiled=False,
2025-05-07T20:31:24.3226434Z )
2025-05-07T20:31:24.3226636Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3227009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3227380Z     T=2048,
2025-05-07T20:31:24.3227573Z     D=7168,
2025-05-07T20:31:24.3227781Z     contiguous=False,
2025-05-07T20:31:24.3228005Z     compiled=True,
2025-05-07T20:31:24.3228224Z )
2025-05-07T20:31:24.3228427Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3228799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3229172Z     T=2048,
2025-05-07T20:31:24.3229367Z     D=7168,
2025-05-07T20:31:24.3229561Z     contiguous=False,
2025-05-07T20:31:24.3229789Z     compiled=False,
2025-05-07T20:31:24.3230008Z )
2025-05-07T20:31:24.3230203Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3230578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3230961Z     T=16384,
2025-05-07T20:31:24.3231158Z     D=7168,
2025-05-07T20:31:24.3231359Z     contiguous=False,
2025-05-07T20:31:24.3231594Z     compiled=True,
2025-05-07T20:31:24.3231804Z )
2025-05-07T20:31:24.3232009Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3232385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3232764Z     T=16384,
2025-05-07T20:31:24.3232957Z     D=7168,
2025-05-07T20:31:24.3233305Z     contiguous=True,
2025-05-07T20:31:24.3233536Z     compiled=True,
2025-05-07T20:31:24.3233737Z )
2025-05-07T20:31:24.3233939Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3234314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3234683Z     T=4096,
2025-05-07T20:31:24.3234873Z     D=7168,
2025-05-07T20:31:24.3235072Z     contiguous=True,
2025-05-07T20:31:24.3235291Z     compiled=True,
2025-05-07T20:31:24.3235496Z )
2025-05-07T20:31:24.3235699Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3236063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3236436Z     T=2048,
2025-05-07T20:31:24.3236626Z     D=5120,
2025-05-07T20:31:24.3236821Z     contiguous=False,
2025-05-07T20:31:24.3237148Z     compiled=False,
2025-05-07T20:31:24.3237353Z )
2025-05-07T20:31:24.3237550Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3238010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3238394Z     T=2048,
2025-05-07T20:31:24.3238581Z     D=5120,
2025-05-07T20:31:24.3238771Z     contiguous=True,
2025-05-07T20:31:24.3238997Z     compiled=False,
2025-05-07T20:31:24.3239204Z )
2025-05-07T20:31:24.3239401Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3239772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3240145Z     T=128,
2025-05-07T20:31:24.3240327Z     D=7168,
2025-05-07T20:31:24.3240527Z     contiguous=False,
2025-05-07T20:31:24.3240752Z     compiled=True,
2025-05-07T20:31:24.3240948Z )
2025-05-07T20:31:24.3241155Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3241544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3241920Z     T=16384,
2025-05-07T20:31:24.3249610Z     D=5120,
2025-05-07T20:31:24.3249888Z     contiguous=True,
2025-05-07T20:31:24.3250157Z     compiled=True,
2025-05-07T20:31:24.3250387Z )
2025-05-07T20:31:24.3250599Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3250990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3251382Z     T=2048,
2025-05-07T20:31:24.3251584Z     D=5120,
2025-05-07T20:31:24.3251788Z     contiguous=False,
2025-05-07T20:31:24.3252026Z     compiled=True,
2025-05-07T20:31:24.3252240Z )
2025-05-07T20:31:24.3252441Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3252820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3253202Z     T=16384,
2025-05-07T20:31:24.3253401Z     D=5120,
2025-05-07T20:31:24.3253606Z     contiguous=True,
2025-05-07T20:31:24.3253840Z     compiled=False,
2025-05-07T20:31:24.3254052Z )
2025-05-07T20:31:24.3254262Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3254652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3255028Z     T=16384,
2025-05-07T20:31:24.3255229Z     D=7168,
2025-05-07T20:31:24.3255439Z     contiguous=False,
2025-05-07T20:31:24.3255672Z     compiled=False,
2025-05-07T20:31:24.3255894Z )
2025-05-07T20:31:24.3256101Z Trying example: test_silu_mul(
2025-05-07T20:31:24.3256473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:24.3256859Z     T=16384,
2025-05-07T20:31:24.3257059Z     D=7168,
2025-05-07T20:31:24.3257258Z     contiguous=True,
2025-05-07T20:31:24.3257488Z     compiled=False,
2025-05-07T20:31:24.3257703Z )
2025-05-07T20:31:24.3257905Z PASSED
2025-05-07T20:31:24.3842346Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.3843447Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.3844824Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.3846247Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.3847605Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.3848968Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.3850746Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.3852111Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.3853529Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.3854760Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.3855969Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.3857167Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.3858191Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:24.3859197Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.3860746Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.3862562Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.3863670Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:24.3864713Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.3865878Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.3867217Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.3868258Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.3869169Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.3869911Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.3870922Z W0507 20:31:24.382000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.4009372Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.4010430Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.4012893Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.4014316Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.4015678Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.4017043Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.4018335Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.4019707Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.4021911Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.4023209Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.4024414Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.4025627Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.4026652Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:24.4027656Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.4028854Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.4030112Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.4031219Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:24.4032245Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.4033404Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.4034739Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.4035774Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.4036874Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.4037617Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.4038622Z W0507 20:31:24.400000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.4420031Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.4421117Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.4422458Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.4423889Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.4425252Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.4426631Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.4427914Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.4429288Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.4430696Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.4431938Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.4433145Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.4434344Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.4435372Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:24.4436372Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.4437579Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.4438840Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.4440430Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:24.4441482Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.4442647Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.4443997Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.4445040Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.4445941Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.4446676Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.4447681Z W0507 20:31:24.441000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.4462782Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:24.4463839Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:24.4465168Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:24.4466589Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:24.4467947Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:24.4469321Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.4470607Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:24.4471976Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.4473378Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:24.4474610Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:24.4475821Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:24.4477169Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:24.4478265Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:24.4479283Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:24.4480488Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:24.4481762Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:24.4482868Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:24.4483906Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:24.4485066Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:24.4486403Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:24.4487452Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.4488349Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:24.4489089Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:24.4490386Z W0507 20:31:24.445000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.8892400Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:24.8893208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:24.8893634Z     T=1,
2025-05-07T20:31:24.8893833Z     D=5120,
2025-05-07T20:31:24.8894027Z     scale_ub=None,
2025-05-07T20:31:24.8894251Z     contiguous=True,
2025-05-07T20:31:24.8894484Z     compiled=True,
2025-05-07T20:31:24.8894700Z )
2025-05-07T20:31:24.8895034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:24.8895566Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:24.8895848Z 
2025-05-07T20:31:24.8895933Z     @given(
2025-05-07T20:31:24.8896172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:24.8896493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:24.8896809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:24.8897150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:24.8897491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:24.8897782Z     )
2025-05-07T20:31:24.8898138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:24.8898598Z     def test_silu_mul_quant(
2025-05-07T20:31:24.8898848Z         self,
2025-05-07T20:31:24.8899050Z         T: int,
2025-05-07T20:31:24.8899250Z         D: int,
2025-05-07T20:31:24.8899477Z         scale_ub: Optional[float],
2025-05-07T20:31:24.8900316Z         contiguous: bool,
2025-05-07T20:31:24.8900558Z         compiled: bool,
2025-05-07T20:31:24.8900789Z     ) -> None:
2025-05-07T20:31:24.8901155Z         torch.manual_seed(2025)
2025-05-07T20:31:24.8901404Z     
2025-05-07T20:31:24.8901694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:24.8902046Z     
2025-05-07T20:31:24.8902239Z         x_sign = torch.sign(x)
2025-05-07T20:31:24.8902541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:24.8902860Z         x = x_sign * x_clamp
2025-05-07T20:31:24.8903104Z         x0 = x[:, :D]
2025-05-07T20:31:24.8903331Z         x1 = x[:, D:]
2025-05-07T20:31:24.8903544Z     
2025-05-07T20:31:24.8903729Z         if contiguous:
2025-05-07T20:31:24.8903971Z             x0 = x0.contiguous()
2025-05-07T20:31:24.8904239Z             x1 = x1.contiguous()
2025-05-07T20:31:24.8904479Z     
2025-05-07T20:31:24.8904678Z         if scale_ub is not None:
2025-05-07T20:31:24.8904961Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:24.8905319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:24.8905631Z             )
2025-05-07T20:31:24.8905837Z         else:
2025-05-07T20:31:24.8906059Z             scale_ub_tensor = None
2025-05-07T20:31:24.8906317Z     
2025-05-07T20:31:24.8906561Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:24.8906882Z             op = silu_mul_quant
2025-05-07T20:31:24.8907137Z             if compiled:
2025-05-07T20:31:24.8907397Z                 op = torch.compile(op)
2025-05-07T20:31:24.8907703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:24.8907978Z     
2025-05-07T20:31:24.8908179Z         y_fp8, y_scale = fn()
2025-05-07T20:31:24.8908481Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:24.8908772Z     
2025-05-07T20:31:24.8909020Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:24.8909365Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:24.8909670Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:24.8909989Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:24.8910362Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:24.8910678Z     
2025-05-07T20:31:24.8910879Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:24.8911084Z 
2025-05-07T20:31:24.8911190Z moe/activation_test.py:126: 
2025-05-07T20:31:24.8911498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:24.8911837Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:24.8912174Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:24.8912989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:24.8913765Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:24.8914323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:24.8915035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:24.8915751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:24.8916498Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:24.8917268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:24.8918040Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:24.8918793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:24.8919445Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:24.8920107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:24.8920745Z     fn()
2025-05-07T20:31:24.8921341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:24.8921932Z     self.fn.run(
2025-05-07T20:31:24.8922415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:24.8922964Z     kernel = self.compile(
2025-05-07T20:31:24.8923512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:24.8924185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:24.8924593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:24.8924828Z 
2025-05-07T20:31:24.8925047Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7e5e5e10>
2025-05-07T20:31:24.8926159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:24.8927717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7e568550>}
2025-05-07T20:31:24.8929238Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:24.8930364Z context = <triton._C.libtriton.ir.context object at 0x7f1c7f7ccf30>
2025-05-07T20:31:24.8930668Z 
2025-05-07T20:31:24.8930841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:24.8931384Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:24.8931874Z                            module_map=module_map)
2025-05-07T20:31:24.8932261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:24.8932634Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:24.8932900Z E       ^
2025-05-07T20:31:24.8933381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:24.8933850Z 
2025-05-07T20:31:24.8934282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:24.8934812Z 
2025-05-07T20:31:24.8934926Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:24.8935351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:24.8935765Z     T=2048,
2025-05-07T20:31:24.8935966Z     D=5120,
2025-05-07T20:31:24.8936159Z     scale_ub=1200.0,
2025-05-07T20:31:24.8936390Z     contiguous=True,
2025-05-07T20:31:24.8936624Z     compiled=False,
2025-05-07T20:31:24.8936829Z )
2025-05-07T20:31:25.4284606Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:25.4286225Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:25.4288236Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:25.4290567Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:25.4292279Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:25.4293807Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.4295108Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:25.4296478Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.4297874Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:25.4299126Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:25.4300432Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:25.4301642Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:25.4302673Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:25.4303684Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:25.4304897Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:25.4306152Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:25.4307253Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:25.4308281Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:25.4309446Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:25.4310797Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:25.4311845Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.4312746Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:25.4313478Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:25.4314483Z W0507 20:31:25.424000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:25.6075241Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:25.6076332Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:25.6077659Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:25.6079097Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:25.6080467Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:25.6081850Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:25.6083141Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:25.6084515Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:25.6085910Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:25.6087142Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:25.6088352Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:25.6089546Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:25.6090814Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:25.6091819Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:25.6093028Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:25.6094289Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:25.6095389Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:25.6096416Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:25.6097577Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:25.6099192Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:25.6100328Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:25.6101228Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:25.6101960Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:25.6102965Z W0507 20:31:25.604000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.1026850Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.1028103Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.1029437Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.1030859Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.1032226Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.1033598Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.1034896Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.1036263Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.1037647Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.1038897Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.1040101Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.1041286Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.1042310Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:26.1043308Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.1044847Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.1046250Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.1047345Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.1048371Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.1049522Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.1050870Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.1051911Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.1052809Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.1053536Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.1054532Z W0507 20:31:26.099000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.1320747Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.1322014Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:26.1323336Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.1324751Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.1326105Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.1327488Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.1328772Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.1330137Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.1331543Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.1332785Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:26.1334367Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.1335582Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:26.1336607Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:26.1337615Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:26.1338815Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.1340282Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.1341907Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.1343211Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:26.1344376Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.1345724Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.1346783Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.1347691Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.1348427Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:26.1349438Z W0507 20:31:26.128000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.8330146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:26.8330802Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:26.8331230Z 
2025-05-07T20:31:26.8331340Z     @given(
2025-05-07T20:31:26.8331675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:26.8332016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:26.8332322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:26.8332676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:26.8332999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:26.8333286Z     )
2025-05-07T20:31:26.8341191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:26.8341684Z     def test_silu_mul_quant(
2025-05-07T20:31:26.8341932Z         self,
2025-05-07T20:31:26.8342139Z         T: int,
2025-05-07T20:31:26.8342521Z         D: int,
2025-05-07T20:31:26.8342784Z         scale_ub: Optional[float],
2025-05-07T20:31:26.8343068Z         contiguous: bool,
2025-05-07T20:31:26.8343310Z         compiled: bool,
2025-05-07T20:31:26.8343916Z     ) -> None:
2025-05-07T20:31:26.8344145Z         torch.manual_seed(2025)
2025-05-07T20:31:26.8344392Z     
2025-05-07T20:31:26.8344811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:26.8345169Z     
2025-05-07T20:31:26.8345365Z         x_sign = torch.sign(x)
2025-05-07T20:31:26.8345666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:26.8345986Z         x = x_sign * x_clamp
2025-05-07T20:31:26.8346230Z         x0 = x[:, :D]
2025-05-07T20:31:26.8346455Z         x1 = x[:, D:]
2025-05-07T20:31:26.8346672Z     
2025-05-07T20:31:26.8346860Z         if contiguous:
2025-05-07T20:31:26.8347100Z             x0 = x0.contiguous()
2025-05-07T20:31:26.8347365Z             x1 = x1.contiguous()
2025-05-07T20:31:26.8347604Z     
2025-05-07T20:31:26.8347809Z         if scale_ub is not None:
2025-05-07T20:31:26.8348085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:26.8348432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:26.8348748Z             )
2025-05-07T20:31:26.8348950Z         else:
2025-05-07T20:31:26.8349169Z             scale_ub_tensor = None
2025-05-07T20:31:26.8349427Z     
2025-05-07T20:31:26.8349666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.8349986Z             op = silu_mul_quant
2025-05-07T20:31:26.8350240Z             if compiled:
2025-05-07T20:31:26.8350495Z                 op = torch.compile(op)
2025-05-07T20:31:26.8350799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:26.8351072Z     
2025-05-07T20:31:26.8351277Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:26.8351442Z 
2025-05-07T20:31:26.8351558Z moe/activation_test.py:117: 
2025-05-07T20:31:26.8351859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.8352196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:26.8352560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:26.8353332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:26.8354043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:26.8354584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:26.8355270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:26.8355942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:26.8356470Z     kernel = self.compile(
2025-05-07T20:31:26.8357019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:26.8357685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.8358081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.8358318Z 
2025-05-07T20:31:26.8358534Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7eb247c0>
2025-05-07T20:31:26.8359619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:26.8361001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f67b250>}
2025-05-07T20:31:26.8362340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:26.8363370Z context = <triton._C.libtriton.ir.context object at 0x7f1c7e62f430>
2025-05-07T20:31:26.8363666Z 
2025-05-07T20:31:26.8363834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:26.8364465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.8365016Z                            module_map=module_map)
2025-05-07T20:31:26.8365388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.8365757Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.8366026Z E       ^
2025-05-07T20:31:26.8366489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.8366945Z 
2025-05-07T20:31:26.8367366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:26.8367884Z 
2025-05-07T20:31:26.8367996Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:26.8368413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:26.8368819Z     T=2048,
2025-05-07T20:31:26.8369018Z     D=5120,
2025-05-07T20:31:26.8369227Z     scale_ub=1200.0,
2025-05-07T20:31:26.8369453Z     contiguous=True,
2025-05-07T20:31:26.8369688Z     compiled=True,
2025-05-07T20:31:26.8369901Z )
2025-05-07T20:31:26.8370222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:26.8370717Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:26.8370982Z 
2025-05-07T20:31:26.8371061Z     @given(
2025-05-07T20:31:26.8371296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:26.8371609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:26.8371912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:26.8372244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:26.8372573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:26.8372857Z     )
2025-05-07T20:31:26.8373206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:26.8373646Z     def test_silu_mul_quant(
2025-05-07T20:31:26.8373898Z         self,
2025-05-07T20:31:26.8374093Z         T: int,
2025-05-07T20:31:26.8374298Z         D: int,
2025-05-07T20:31:26.8374527Z         scale_ub: Optional[float],
2025-05-07T20:31:26.8374801Z         contiguous: bool,
2025-05-07T20:31:26.8375046Z         compiled: bool,
2025-05-07T20:31:26.8375274Z     ) -> None:
2025-05-07T20:31:26.8375487Z         torch.manual_seed(2025)
2025-05-07T20:31:26.8375741Z     
2025-05-07T20:31:26.8376018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:26.8376355Z     
2025-05-07T20:31:26.8376557Z         x_sign = torch.sign(x)
2025-05-07T20:31:26.8376854Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:26.8377158Z         x = x_sign * x_clamp
2025-05-07T20:31:26.8377405Z         x0 = x[:, :D]
2025-05-07T20:31:26.8377633Z         x1 = x[:, D:]
2025-05-07T20:31:26.8377843Z     
2025-05-07T20:31:26.8378040Z         if contiguous:
2025-05-07T20:31:26.8378280Z             x0 = x0.contiguous()
2025-05-07T20:31:26.8378549Z             x1 = x1.contiguous()
2025-05-07T20:31:26.8378789Z     
2025-05-07T20:31:26.8379001Z         if scale_ub is not None:
2025-05-07T20:31:26.8379279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:26.8379608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:26.8380047Z             )
2025-05-07T20:31:26.8380249Z         else:
2025-05-07T20:31:26.8380465Z             scale_ub_tensor = None
2025-05-07T20:31:26.8380764Z     
2025-05-07T20:31:26.8381002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.8381309Z             op = silu_mul_quant
2025-05-07T20:31:26.8381565Z             if compiled:
2025-05-07T20:31:26.8381817Z                 op = torch.compile(op)
2025-05-07T20:31:26.8382110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:26.8382388Z     
2025-05-07T20:31:26.8382584Z         y_fp8, y_scale = fn()
2025-05-07T20:31:26.8382876Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:26.8383274Z     
2025-05-07T20:31:26.8383516Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.8384417Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:26.8384719Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:26.8385039Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:26.8385404Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:26.8385711Z     
2025-05-07T20:31:26.8385924Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:26.8386124Z 
2025-05-07T20:31:26.8386237Z moe/activation_test.py:126: 
2025-05-07T20:31:26.8386532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.8386868Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:26.8387197Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:26.8387985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:26.8388741Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:26.8389294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:26.8390336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:26.8391023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:26.8391740Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:26.8392489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:26.8393227Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:26.8393943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:26.8394590Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:26.8395188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:26.8395704Z     fn()
2025-05-07T20:31:26.8396206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:26.8396782Z     self.fn.run(
2025-05-07T20:31:26.8397257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:26.8397780Z     kernel = self.compile(
2025-05-07T20:31:26.8398316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:26.8398973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.8399367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.8399598Z 
2025-05-07T20:31:26.8399812Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7e567ca0>
2025-05-07T20:31:26.8400884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:26.8402263Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f67a950>}
2025-05-07T20:31:26.8403593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:26.8404625Z context = <triton._C.libtriton.ir.context object at 0x7f1c56ef9e30>
2025-05-07T20:31:26.8405072Z 
2025-05-07T20:31:26.8405241Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:26.8405896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.8406365Z                            module_map=module_map)
2025-05-07T20:31:26.8406723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.8407075Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:26.8407338Z E       ^
2025-05-07T20:31:26.8407796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.8408238Z 
2025-05-07T20:31:26.8408649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:26.8409163Z 
2025-05-07T20:31:26.8409266Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:26.8409674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:26.8410071Z     T=16384,
2025-05-07T20:31:26.8410266Z     D=7168,
2025-05-07T20:31:26.8410460Z     scale_ub=1200.0,
2025-05-07T20:31:26.8410689Z     contiguous=False,
2025-05-07T20:31:26.8410913Z     compiled=False,
2025-05-07T20:31:26.8411120Z )
2025-05-07T20:31:27.2118732Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.2119980Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.2121369Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.2122803Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.2124220Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.2125603Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.2126902Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.2128267Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.2129685Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.2130917Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.2132137Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.2133343Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.2134380Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:27.2135864Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.2137078Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.2138355Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.2139470Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.2140607Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.2141798Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.2143148Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.2144208Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.2145118Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.2145858Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.2146879Z W0507 20:31:27.208000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.3541641Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.3543016Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.3544339Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.3545746Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.3547140Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.3548496Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.3549790Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.3551143Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.3553025Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.3554254Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.3555458Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.3556649Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.3557664Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:27.3558678Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.3559877Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.3561284Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.3562383Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.3563409Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.3564580Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.3565919Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.3566954Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.3567850Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.3568582Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.3569589Z W0507 20:31:27.350000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.7946766Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.7947883Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.7949214Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.7950636Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.7952548Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.7953920Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.7955219Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.7956583Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.7958006Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.7959247Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.7960455Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.7961702Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.7962735Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:27.7963757Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.7964969Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.7966239Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.7967342Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.7968379Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.7969566Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.7970919Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.7971965Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.7972871Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.7973611Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.7974631Z W0507 20:31:27.791000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.8246666Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.8247771Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:27.8249080Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.8250486Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.8251868Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.8253244Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.8254540Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.8255911Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.8257314Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.8258554Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:27.8259776Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.8261074Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:27.8262109Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:27.8263117Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:27.8264344Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.8265632Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.8266744Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.8267775Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:27.8269091Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.8270576Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.8271636Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.8272706Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.8273462Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:27.8274469Z W0507 20:31:27.821000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.1462142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:29.1463083Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:29.1463563Z 
2025-05-07T20:31:29.1463692Z     @given(
2025-05-07T20:31:29.1464061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:29.1464566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:29.1465067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:29.1465591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:29.1466053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:29.1466473Z     )
2025-05-07T20:31:29.1466997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:29.1467677Z     def test_silu_mul_quant(
2025-05-07T20:31:29.1468047Z         self,
2025-05-07T20:31:29.1468352Z         T: int,
2025-05-07T20:31:29.1468657Z         D: int,
2025-05-07T20:31:29.1469008Z         scale_ub: Optional[float],
2025-05-07T20:31:29.1469484Z         contiguous: bool,
2025-05-07T20:31:29.1469868Z         compiled: bool,
2025-05-07T20:31:29.1470238Z     ) -> None:
2025-05-07T20:31:29.1470597Z         torch.manual_seed(2025)
2025-05-07T20:31:29.1471010Z     
2025-05-07T20:31:29.1471463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:29.1472041Z     
2025-05-07T20:31:29.1472346Z         x_sign = torch.sign(x)
2025-05-07T20:31:29.1472808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:29.1473337Z         x = x_sign * x_clamp
2025-05-07T20:31:29.1473744Z         x0 = x[:, :D]
2025-05-07T20:31:29.1474073Z         x1 = x[:, D:]
2025-05-07T20:31:29.1474401Z     
2025-05-07T20:31:29.1474695Z         if contiguous:
2025-05-07T20:31:29.1475065Z             x0 = x0.contiguous()
2025-05-07T20:31:29.1475495Z             x1 = x1.contiguous()
2025-05-07T20:31:29.1475900Z     
2025-05-07T20:31:29.1476210Z         if scale_ub is not None:
2025-05-07T20:31:29.1476654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:29.1477213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:29.1477732Z             )
2025-05-07T20:31:29.1478041Z         else:
2025-05-07T20:31:29.1478393Z             scale_ub_tensor = None
2025-05-07T20:31:29.1478825Z     
2025-05-07T20:31:29.1479201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.1479734Z             op = silu_mul_quant
2025-05-07T20:31:29.1480146Z             if compiled:
2025-05-07T20:31:29.1480545Z                 op = torch.compile(op)
2025-05-07T20:31:29.1481039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.1481500Z     
2025-05-07T20:31:29.1481807Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:29.1482098Z 
2025-05-07T20:31:29.1482262Z moe/activation_test.py:117: 
2025-05-07T20:31:29.1482766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.1483796Z moe/activation_test.py:115: in fn
2025-05-07T20:31:29.1484458Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.1485695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:29.1486929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:29.1487854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:29.1489029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:29.1490519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:29.1491466Z     kernel = self.compile(
2025-05-07T20:31:29.1492409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:29.1493583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.1494287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.1494686Z 
2025-05-07T20:31:29.1495028Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7f6eb7f0>
2025-05-07T20:31:29.1496948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:29.1499350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7e5be4d0>}
2025-05-07T20:31:29.1501821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:29.1503600Z context = <triton._C.libtriton.ir.context object at 0x7f1c56f3a370>
2025-05-07T20:31:29.1504081Z 
2025-05-07T20:31:29.1504355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:29.1505229Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.1506019Z                            module_map=module_map)
2025-05-07T20:31:29.1506607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.1507169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.1507595Z E       ^
2025-05-07T20:31:29.1508372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.1509153Z 
2025-05-07T20:31:29.1509880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:29.1510785Z 
2025-05-07T20:31:29.1510960Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:29.1511648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:29.1512331Z     T=1,
2025-05-07T20:31:29.1512616Z     D=7168,
2025-05-07T20:31:29.1512928Z     scale_ub=None,
2025-05-07T20:31:29.1513275Z     contiguous=True,
2025-05-07T20:31:29.1513624Z     compiled=True,
2025-05-07T20:31:29.1513957Z )
2025-05-07T20:31:29.1514485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:29.1515286Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:29.1515724Z 
2025-05-07T20:31:29.1515846Z     @given(
2025-05-07T20:31:29.1516213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:29.1516739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:29.1517267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:29.1517819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:29.1518344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:29.1519050Z     )
2025-05-07T20:31:29.1529185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:29.1530011Z     def test_silu_mul_quant(
2025-05-07T20:31:29.1530425Z         self,
2025-05-07T20:31:29.1530739Z         T: int,
2025-05-07T20:31:29.1531054Z         D: int,
2025-05-07T20:31:29.1531454Z         scale_ub: Optional[float],
2025-05-07T20:31:29.1531913Z         contiguous: bool,
2025-05-07T20:31:29.1532300Z         compiled: bool,
2025-05-07T20:31:29.1532676Z     ) -> None:
2025-05-07T20:31:29.1533022Z         torch.manual_seed(2025)
2025-05-07T20:31:29.1533405Z     
2025-05-07T20:31:29.1533856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:29.1534436Z     
2025-05-07T20:31:29.1534746Z         x_sign = torch.sign(x)
2025-05-07T20:31:29.1535235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:29.1535763Z         x = x_sign * x_clamp
2025-05-07T20:31:29.1536164Z         x0 = x[:, :D]
2025-05-07T20:31:29.1536514Z         x1 = x[:, D:]
2025-05-07T20:31:29.1536858Z     
2025-05-07T20:31:29.1537175Z         if contiguous:
2025-05-07T20:31:29.1537556Z             x0 = x0.contiguous()
2025-05-07T20:31:29.1537983Z             x1 = x1.contiguous()
2025-05-07T20:31:29.1538378Z     
2025-05-07T20:31:29.1538682Z         if scale_ub is not None:
2025-05-07T20:31:29.1539141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:29.1539703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:29.1540360Z             )
2025-05-07T20:31:29.1540674Z         else:
2025-05-07T20:31:29.1541018Z             scale_ub_tensor = None
2025-05-07T20:31:29.1541416Z     
2025-05-07T20:31:29.1541796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.1542322Z             op = silu_mul_quant
2025-05-07T20:31:29.1542723Z             if compiled:
2025-05-07T20:31:29.1543130Z                 op = torch.compile(op)
2025-05-07T20:31:29.1543635Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:29.1544087Z     
2025-05-07T20:31:29.1544404Z         y_fp8, y_scale = fn()
2025-05-07T20:31:29.1544876Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:29.1545348Z     
2025-05-07T20:31:29.1545732Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:29.1546293Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:29.1546785Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:29.1547308Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:29.1547916Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:29.1548446Z     
2025-05-07T20:31:29.1548768Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:29.1549110Z 
2025-05-07T20:31:29.1549273Z moe/activation_test.py:126: 
2025-05-07T20:31:29.1549779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.1550360Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:29.1550907Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:29.1552331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:29.1553693Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:29.1554611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:29.1555797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:29.1556983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:29.1558270Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:29.1559552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:29.1561007Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:29.1562320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:29.1563460Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:29.1564515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:29.1565433Z     fn()
2025-05-07T20:31:29.1566327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:29.1567353Z     self.fn.run(
2025-05-07T20:31:29.1568172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:29.1569109Z     kernel = self.compile(
2025-05-07T20:31:29.1570059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:29.1571224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.1571905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:29.1572303Z 
2025-05-07T20:31:29.1572660Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c84748bb0>
2025-05-07T20:31:29.1574567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:29.1577050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cb1f90160>}
2025-05-07T20:31:29.1579464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:29.1581408Z context = <triton._C.libtriton.ir.context object at 0x7f1c6cca43f0>
2025-05-07T20:31:29.1581910Z 
2025-05-07T20:31:29.1582180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:29.1583044Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.1583821Z                            module_map=module_map)
2025-05-07T20:31:29.1584418Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.1585003Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:29.1585434Z E       ^
2025-05-07T20:31:29.1586197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.1586993Z 
2025-05-07T20:31:29.1587740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:29.1588685Z 
2025-05-07T20:31:29.1588870Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:29.1589588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:29.1590590Z     T=4096,
2025-05-07T20:31:29.1590899Z     D=5120,
2025-05-07T20:31:29.1591212Z     scale_ub=None,
2025-05-07T20:31:29.1591554Z     contiguous=False,
2025-05-07T20:31:29.1591928Z     compiled=False,
2025-05-07T20:31:29.1592265Z )
2025-05-07T20:31:29.7126225Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.7128190Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:29.7131074Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.7133922Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.7136473Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.7138894Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.7141351Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.7143783Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.7146274Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.7148502Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:29.7150704Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.7152888Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:29.7154737Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:29.7156507Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:29.7158689Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.7160997Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.7163001Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.7164877Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:29.7166987Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.7169423Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.7171306Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.7172917Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.7174486Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:29.7176295Z W0507 20:31:29.709000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.2414635Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.2416589Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.2419051Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.2421822Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.2424362Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.2426803Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.2429078Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.2431588Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.2434190Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.2436471Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.2438727Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.2440942Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.2442842Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:30.2444634Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.2446701Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.2448923Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.2450894Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.2453401Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.2455525Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.2457963Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.2459827Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.2461512Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.2462837Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.2464652Z W0507 20:31:30.238000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.9307663Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.9309515Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.9311758Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.9314352Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.9316842Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.9319192Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.9321487Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.9323997Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.9326610Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.9328847Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.9330963Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.9332908Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.9335130Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:30.9336741Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.9338679Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.9340831Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.9342624Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.9344281Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.9346178Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.9348365Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.9350210Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.9351803Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.9353104Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.9354917Z W0507 20:31:30.927000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.9620872Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.9622672Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:30.9624963Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.9627445Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.9629857Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.9632289Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.9634563Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.9637012Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.9640010Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.9642159Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:30.9645267Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.9647388Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:30.9649250Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:30.9651026Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:30.9653109Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.9655334Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.9657252Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.9659044Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:30.9661207Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.9663593Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.9665393Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.9666960Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.9668209Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:30.9670021Z W0507 20:31:30.958000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.2477605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:34.2478317Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:34.2478605Z 
2025-05-07T20:31:34.2478690Z     @given(
2025-05-07T20:31:34.2478934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:34.2479248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:34.2479562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:34.2479897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:34.2480220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:34.2480505Z     )
2025-05-07T20:31:34.2480863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:34.2481728Z     def test_silu_mul_quant(
2025-05-07T20:31:34.2482114Z         self,
2025-05-07T20:31:34.2482325Z         T: int,
2025-05-07T20:31:34.2482535Z         D: int,
2025-05-07T20:31:34.2482761Z         scale_ub: Optional[float],
2025-05-07T20:31:34.2483038Z         contiguous: bool,
2025-05-07T20:31:34.2483284Z         compiled: bool,
2025-05-07T20:31:34.2483513Z     ) -> None:
2025-05-07T20:31:34.2483737Z         torch.manual_seed(2025)
2025-05-07T20:31:34.2483994Z     
2025-05-07T20:31:34.2484272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:34.2484624Z     
2025-05-07T20:31:34.2484824Z         x_sign = torch.sign(x)
2025-05-07T20:31:34.2485116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:34.2485434Z         x = x_sign * x_clamp
2025-05-07T20:31:34.2485685Z         x0 = x[:, :D]
2025-05-07T20:31:34.2485905Z         x1 = x[:, D:]
2025-05-07T20:31:34.2486128Z     
2025-05-07T20:31:34.2486322Z         if contiguous:
2025-05-07T20:31:34.2486559Z             x0 = x0.contiguous()
2025-05-07T20:31:34.2486833Z             x1 = x1.contiguous()
2025-05-07T20:31:34.2487079Z     
2025-05-07T20:31:34.2487279Z         if scale_ub is not None:
2025-05-07T20:31:34.2487555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:34.2487896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:34.2488217Z             )
2025-05-07T20:31:34.2488416Z         else:
2025-05-07T20:31:34.2488637Z             scale_ub_tensor = None
2025-05-07T20:31:34.2488897Z     
2025-05-07T20:31:34.2489134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.2489457Z             op = silu_mul_quant
2025-05-07T20:31:34.2489722Z             if compiled:
2025-05-07T20:31:34.2490261Z                 op = torch.compile(op)
2025-05-07T20:31:34.2490570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.2490857Z     
2025-05-07T20:31:34.2491054Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:34.2491231Z 
2025-05-07T20:31:34.2491336Z moe/activation_test.py:117: 
2025-05-07T20:31:34.2491646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.2491987Z moe/activation_test.py:115: in fn
2025-05-07T20:31:34.2492270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.2492975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:34.2493664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:34.2494209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:34.2494897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:34.2495568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:34.2496105Z     kernel = self.compile(
2025-05-07T20:31:34.2496658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:34.2497322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.2497717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.2497954Z 
2025-05-07T20:31:34.2498164Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7f7bf700>
2025-05-07T20:31:34.2499246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:34.2501114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f67ab90>}
2025-05-07T20:31:34.2503440Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:34.2504482Z context = <triton._C.libtriton.ir.context object at 0x7f1c6cd08ab0>
2025-05-07T20:31:34.2504777Z 
2025-05-07T20:31:34.2504947Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:34.2505468Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.2505938Z                            module_map=module_map)
2025-05-07T20:31:34.2506301Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.2506658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.2506929Z E       ^
2025-05-07T20:31:34.2507386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.2507847Z 
2025-05-07T20:31:34.2508268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:34.2508782Z 
2025-05-07T20:31:34.2508887Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:34.2509302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:34.2509694Z     T=4096,
2025-05-07T20:31:34.2509882Z     D=7168,
2025-05-07T20:31:34.2510080Z     scale_ub=None,
2025-05-07T20:31:34.2510292Z     contiguous=False,
2025-05-07T20:31:34.2510521Z     compiled=False,
2025-05-07T20:31:34.2510730Z )
2025-05-07T20:31:34.2511042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:34.2511543Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:34.2511811Z 
2025-05-07T20:31:34.2511897Z     @given(
2025-05-07T20:31:34.2512131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:34.2512446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:34.2512756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:34.2513092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:34.2513413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:34.2513700Z     )
2025-05-07T20:31:34.2514052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:34.2514488Z     def test_silu_mul_quant(
2025-05-07T20:31:34.2514734Z         self,
2025-05-07T20:31:34.2514932Z         T: int,
2025-05-07T20:31:34.2515125Z         D: int,
2025-05-07T20:31:34.2515349Z         scale_ub: Optional[float],
2025-05-07T20:31:34.2515623Z         contiguous: bool,
2025-05-07T20:31:34.2515858Z         compiled: bool,
2025-05-07T20:31:34.2516081Z     ) -> None:
2025-05-07T20:31:34.2516302Z         torch.manual_seed(2025)
2025-05-07T20:31:34.2516542Z     
2025-05-07T20:31:34.2516823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:34.2517174Z     
2025-05-07T20:31:34.2517371Z         x_sign = torch.sign(x)
2025-05-07T20:31:34.2517664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:34.2517976Z         x = x_sign * x_clamp
2025-05-07T20:31:34.2518223Z         x0 = x[:, :D]
2025-05-07T20:31:34.2518435Z         x1 = x[:, D:]
2025-05-07T20:31:34.2518651Z     
2025-05-07T20:31:34.2518840Z         if contiguous:
2025-05-07T20:31:34.2519068Z             x0 = x0.contiguous()
2025-05-07T20:31:34.2519333Z             x1 = x1.contiguous()
2025-05-07T20:31:34.2519583Z     
2025-05-07T20:31:34.2519777Z         if scale_ub is not None:
2025-05-07T20:31:34.2520054Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:34.2520390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:34.2520697Z             )
2025-05-07T20:31:34.2520892Z         else:
2025-05-07T20:31:34.2521110Z             scale_ub_tensor = None
2025-05-07T20:31:34.2521357Z     
2025-05-07T20:31:34.2521766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.2522086Z             op = silu_mul_quant
2025-05-07T20:31:34.2522419Z             if compiled:
2025-05-07T20:31:34.2522679Z                 op = torch.compile(op)
2025-05-07T20:31:34.2523016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.2523295Z     
2025-05-07T20:31:34.2523483Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:34.2523652Z 
2025-05-07T20:31:34.2523753Z moe/activation_test.py:117: 
2025-05-07T20:31:34.2524053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.2524379Z moe/activation_test.py:115: in fn
2025-05-07T20:31:34.2524661Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.2525349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:34.2526045Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:34.2526575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:34.2527276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:34.2527937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:34.2528460Z     kernel = self.compile(
2025-05-07T20:31:34.2529006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:34.2529674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.2530069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.2530300Z 
2025-05-07T20:31:34.2530506Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c84bbe440>
2025-05-07T20:31:34.2531578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:34.2532980Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c84cdf370>}
2025-05-07T20:31:34.2534312Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:34.2535328Z context = <triton._C.libtriton.ir.context object at 0x7f1c6ccc75b0>
2025-05-07T20:31:34.2535613Z 
2025-05-07T20:31:34.2535778Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:34.2536297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.2536762Z                            module_map=module_map)
2025-05-07T20:31:34.2537127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.2537480Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.2537745Z E       ^
2025-05-07T20:31:34.2538208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.2538651Z 
2025-05-07T20:31:34.2539066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:34.2539585Z 
2025-05-07T20:31:34.2539691Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:34.2540252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:34.2540833Z     T=128,
2025-05-07T20:31:34.2541098Z     D=7168,
2025-05-07T20:31:34.2541366Z     scale_ub=None,
2025-05-07T20:31:34.2541657Z     contiguous=False,
2025-05-07T20:31:34.2541975Z     compiled=True,
2025-05-07T20:31:34.2542257Z )
2025-05-07T20:31:34.3221788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:34.3223159Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:34.3223759Z 
2025-05-07T20:31:34.3223857Z     @given(
2025-05-07T20:31:34.3224099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:34.3224425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:34.3224750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:34.3225104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:34.3225430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:34.3225729Z     )
2025-05-07T20:31:34.3226088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:34.3234245Z     def test_silu_mul_quant(
2025-05-07T20:31:34.3234643Z         self,
2025-05-07T20:31:34.3234856Z         T: int,
2025-05-07T20:31:34.3235081Z         D: int,
2025-05-07T20:31:34.3235322Z         scale_ub: Optional[float],
2025-05-07T20:31:34.3235620Z         contiguous: bool,
2025-05-07T20:31:34.3235880Z         compiled: bool,
2025-05-07T20:31:34.3236126Z     ) -> None:
2025-05-07T20:31:34.3236372Z         torch.manual_seed(2025)
2025-05-07T20:31:34.3236640Z     
2025-05-07T20:31:34.3236938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:34.3237301Z     
2025-05-07T20:31:34.3237507Z         x_sign = torch.sign(x)
2025-05-07T20:31:34.3237819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:34.3238148Z         x = x_sign * x_clamp
2025-05-07T20:31:34.3238399Z         x0 = x[:, :D]
2025-05-07T20:31:34.3238637Z         x1 = x[:, D:]
2025-05-07T20:31:34.3238865Z     
2025-05-07T20:31:34.3239063Z         if contiguous:
2025-05-07T20:31:34.3239312Z             x0 = x0.contiguous()
2025-05-07T20:31:34.3239590Z             x1 = x1.contiguous()
2025-05-07T20:31:34.3239840Z     
2025-05-07T20:31:34.3240048Z         if scale_ub is not None:
2025-05-07T20:31:34.3240344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:34.3240689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:34.3241016Z             )
2025-05-07T20:31:34.3241229Z         else:
2025-05-07T20:31:34.3241450Z             scale_ub_tensor = None
2025-05-07T20:31:34.3241716Z     
2025-05-07T20:31:34.3241970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.3242292Z             op = silu_mul_quant
2025-05-07T20:31:34.3242566Z             if compiled:
2025-05-07T20:31:34.3242832Z                 op = torch.compile(op)
2025-05-07T20:31:34.3243144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.3243430Z     
2025-05-07T20:31:34.3243644Z         y_fp8, y_scale = fn()
2025-05-07T20:31:34.3243946Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:34.3244244Z     
2025-05-07T20:31:34.3244495Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.3244843Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:34.3245153Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:34.3245494Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:34.3245867Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:34.3246188Z     
2025-05-07T20:31:34.3246407Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:34.3246616Z 
2025-05-07T20:31:34.3246725Z moe/activation_test.py:126: 
2025-05-07T20:31:34.3247040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.3247385Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:34.3247729Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:34.3248542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:34.3249303Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:34.3249874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:34.3250791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:34.3251491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:34.3252213Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:34.3253039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:34.3253794Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:34.3254528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:34.3255166Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:34.3255776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:34.3256320Z     fn()
2025-05-07T20:31:34.3256843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:34.3257444Z     self.fn.run(
2025-05-07T20:31:34.3257924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:34.3258470Z     kernel = self.compile(
2025-05-07T20:31:34.3259014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:34.3259674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.3260182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.3260413Z 
2025-05-07T20:31:34.3260634Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6cdc14e0>
2025-05-07T20:31:34.3261722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:34.3263109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f7af9a0>}
2025-05-07T20:31:34.3264442Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:34.3265464Z context = <triton._C.libtriton.ir.context object at 0x7f1c56805a70>
2025-05-07T20:31:34.3265757Z 
2025-05-07T20:31:34.3265926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:34.3266451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.3266924Z                            module_map=module_map)
2025-05-07T20:31:34.3267308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.3267675Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:34.3267951Z E       ^
2025-05-07T20:31:34.3268414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.3268864Z 
2025-05-07T20:31:34.3269279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:34.3269789Z 
2025-05-07T20:31:34.3269903Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:34.3270314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:34.3270719Z     T=128,
2025-05-07T20:31:34.3270924Z     D=7168,
2025-05-07T20:31:34.3271131Z     scale_ub=None,
2025-05-07T20:31:34.3271353Z     contiguous=False,
2025-05-07T20:31:34.3271686Z     compiled=False,
2025-05-07T20:31:34.3271903Z )
2025-05-07T20:31:34.5376286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:34.5377097Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:34.5377481Z 
2025-05-07T20:31:34.5377591Z     @given(
2025-05-07T20:31:34.5377915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:34.5378300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:34.5378618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:34.5378960Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:34.5379289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:34.5379576Z     )
2025-05-07T20:31:34.5380092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:34.5380538Z     def test_silu_mul_quant(
2025-05-07T20:31:34.5380786Z         self,
2025-05-07T20:31:34.5380990Z         T: int,
2025-05-07T20:31:34.5381209Z         D: int,
2025-05-07T20:31:34.5381434Z         scale_ub: Optional[float],
2025-05-07T20:31:34.5381717Z         contiguous: bool,
2025-05-07T20:31:34.5381966Z         compiled: bool,
2025-05-07T20:31:34.5382193Z     ) -> None:
2025-05-07T20:31:34.5382416Z         torch.manual_seed(2025)
2025-05-07T20:31:34.5382664Z     
2025-05-07T20:31:34.5382944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:34.5383296Z     
2025-05-07T20:31:34.5383494Z         x_sign = torch.sign(x)
2025-05-07T20:31:34.5383784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:34.5384098Z         x = x_sign * x_clamp
2025-05-07T20:31:34.5384344Z         x0 = x[:, :D]
2025-05-07T20:31:34.5384560Z         x1 = x[:, D:]
2025-05-07T20:31:34.5384772Z     
2025-05-07T20:31:34.5384966Z         if contiguous:
2025-05-07T20:31:34.5385200Z             x0 = x0.contiguous()
2025-05-07T20:31:34.5385472Z             x1 = x1.contiguous()
2025-05-07T20:31:34.5385722Z     
2025-05-07T20:31:34.5385918Z         if scale_ub is not None:
2025-05-07T20:31:34.5386197Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:34.5386539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:34.5386847Z             )
2025-05-07T20:31:34.5387048Z         else:
2025-05-07T20:31:34.5387270Z             scale_ub_tensor = None
2025-05-07T20:31:34.5387531Z     
2025-05-07T20:31:34.5387763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.5388092Z             op = silu_mul_quant
2025-05-07T20:31:34.5388354Z             if compiled:
2025-05-07T20:31:34.5388606Z                 op = torch.compile(op)
2025-05-07T20:31:34.5388907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.5389188Z     
2025-05-07T20:31:34.5389384Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:34.5389560Z 
2025-05-07T20:31:34.5389662Z moe/activation_test.py:117: 
2025-05-07T20:31:34.5390309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.5390651Z moe/activation_test.py:115: in fn
2025-05-07T20:31:34.5390942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.5391635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:34.5392331Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:34.5392891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:34.5393600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:34.5394263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:34.5394789Z     kernel = self.compile(
2025-05-07T20:31:34.5395331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:34.5396198Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.5396706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.5396935Z 
2025-05-07T20:31:34.5397147Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c56cd7c70>
2025-05-07T20:31:34.5398227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:34.5399600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c84b2edd0>}
2025-05-07T20:31:34.5400934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:34.5401975Z context = <triton._C.libtriton.ir.context object at 0x7f1c6ca4fcb0>
2025-05-07T20:31:34.5402261Z 
2025-05-07T20:31:34.5402434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:34.5402959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.5403429Z                            module_map=module_map)
2025-05-07T20:31:34.5403793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.5404147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.5404411Z E       ^
2025-05-07T20:31:34.5404879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.5405324Z 
2025-05-07T20:31:34.5405736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:34.5406252Z 
2025-05-07T20:31:34.5406363Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:34.5406776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:34.5407181Z     T=4096,
2025-05-07T20:31:34.5407368Z     D=5120,
2025-05-07T20:31:34.5407571Z     scale_ub=1200.0,
2025-05-07T20:31:34.5407799Z     contiguous=True,
2025-05-07T20:31:34.5408022Z     compiled=False,
2025-05-07T20:31:34.5408237Z )
2025-05-07T20:31:34.5408564Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:34.5409053Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:34.5409331Z 
2025-05-07T20:31:34.5409413Z     @given(
2025-05-07T20:31:34.5409648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:34.5409956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:34.5410268Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:34.5410600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:34.5410940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:34.5411221Z     )
2025-05-07T20:31:34.5411581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:34.5412028Z     def test_silu_mul_quant(
2025-05-07T20:31:34.5412269Z         self,
2025-05-07T20:31:34.5412472Z         T: int,
2025-05-07T20:31:34.5412675Z         D: int,
2025-05-07T20:31:34.5412899Z         scale_ub: Optional[float],
2025-05-07T20:31:34.5413180Z         contiguous: bool,
2025-05-07T20:31:34.5413425Z         compiled: bool,
2025-05-07T20:31:34.5413650Z     ) -> None:
2025-05-07T20:31:34.5413872Z         torch.manual_seed(2025)
2025-05-07T20:31:34.5414123Z     
2025-05-07T20:31:34.5414396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:34.5414738Z     
2025-05-07T20:31:34.5414936Z         x_sign = torch.sign(x)
2025-05-07T20:31:34.5415225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:34.5415632Z         x = x_sign * x_clamp
2025-05-07T20:31:34.5415876Z         x0 = x[:, :D]
2025-05-07T20:31:34.5416096Z         x1 = x[:, D:]
2025-05-07T20:31:34.5416445Z     
2025-05-07T20:31:34.5416638Z         if contiguous:
2025-05-07T20:31:34.5416875Z             x0 = x0.contiguous()
2025-05-07T20:31:34.5417137Z             x1 = x1.contiguous()
2025-05-07T20:31:34.5417383Z     
2025-05-07T20:31:34.5417578Z         if scale_ub is not None:
2025-05-07T20:31:34.5417849Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:34.5418185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:34.5418501Z             )
2025-05-07T20:31:34.5418693Z         else:
2025-05-07T20:31:34.5418911Z             scale_ub_tensor = None
2025-05-07T20:31:34.5419167Z     
2025-05-07T20:31:34.5419396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:34.5419712Z             op = silu_mul_quant
2025-05-07T20:31:34.5420064Z             if compiled:
2025-05-07T20:31:34.5420312Z                 op = torch.compile(op)
2025-05-07T20:31:34.5420618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.5420895Z     
2025-05-07T20:31:34.5421095Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:34.5421266Z 
2025-05-07T20:31:34.5421369Z moe/activation_test.py:117: 
2025-05-07T20:31:34.5421668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.5422001Z moe/activation_test.py:115: in fn
2025-05-07T20:31:34.5422286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:34.5422976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:34.5423687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:34.5424227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:34.5424898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:34.5425568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:34.5426115Z     kernel = self.compile(
2025-05-07T20:31:34.5426660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:34.5427308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.5427708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:34.5427934Z 
2025-05-07T20:31:34.5428150Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6c9851b0>
2025-05-07T20:31:34.5429223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:34.5430596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7c14c940>}
2025-05-07T20:31:34.5431940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:34.5432962Z context = <triton._C.libtriton.ir.context object at 0x7f1c562c36b0>
2025-05-07T20:31:34.5433248Z 
2025-05-07T20:31:34.5433422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:34.5433936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.5434404Z                            module_map=module_map)
2025-05-07T20:31:34.5434772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.5435126Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.5435383Z E       ^
2025-05-07T20:31:34.5435846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.5436385Z 
2025-05-07T20:31:34.5436877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:34.5437396Z 
2025-05-07T20:31:34.5437502Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:34.5437916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:34.5438318Z     T=1,
2025-05-07T20:31:34.5438508Z     D=5120,
2025-05-07T20:31:34.5438702Z     scale_ub=None,
2025-05-07T20:31:34.5438927Z     contiguous=True,
2025-05-07T20:31:34.5439155Z     compiled=True,
2025-05-07T20:31:34.5439360Z )
2025-05-07T20:31:35.0182215Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.0183584Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:35.0184971Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.0186405Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.0187787Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.0189181Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.0190793Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.0192178Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.0193578Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.0194817Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:35.0196035Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.0197242Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:35.0198273Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:35.0199281Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:35.0200493Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.0202137Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.0203253Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.0204281Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:35.0205451Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.0206800Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.0207869Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.0208781Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.0209514Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:35.0210529Z W0507 20:31:35.014000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.1818804Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.1820538Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:35.1822565Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.1824141Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.1825504Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.1826873Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.1828179Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.1829542Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.1830947Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.1832179Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:35.1833397Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.1835072Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:35.1836111Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:35.1837125Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:35.1838329Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.1839597Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.1840715Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.1841750Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:35.1842913Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.1844259Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.1845316Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.1846231Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.1846968Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:35.1847976Z W0507 20:31:35.178000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.6318153Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.6319489Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:35.6320853Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.6322270Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.6323637Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.6324998Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.6326653Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.6328162Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.6329556Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.6330790Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:35.6332000Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.6333201Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:35.6334231Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:35.6335239Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:35.6336448Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.6337704Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.6338816Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.6339939Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:35.6341109Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.6342451Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.6343496Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.6344413Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.6345149Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:35.6346155Z W0507 20:31:35.628000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.6608857Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:35.6610172Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:35.6611695Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:35.6613208Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:35.6614566Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:35.6615922Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.6617213Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:35.6618894Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.6620374Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:35.6621606Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:35.6622805Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:35.6624012Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:35.6625024Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:35.6626036Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:35.6627237Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:35.6628509Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:35.6629617Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:35.6630648Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:35.6631807Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:35.6633141Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:35.6634179Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.6635174Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:35.6635979Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:35.6636989Z W0507 20:31:35.657000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.9690748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:35.9691488Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:35.9691866Z 
2025-05-07T20:31:35.9692022Z     @given(
2025-05-07T20:31:35.9692351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:35.9692790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:35.9693256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:35.9693707Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:35.9694048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:35.9694333Z     )
2025-05-07T20:31:35.9702246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:35.9702834Z     def test_silu_mul_quant(
2025-05-07T20:31:35.9703093Z         self,
2025-05-07T20:31:35.9703292Z         T: int,
2025-05-07T20:31:35.9703501Z         D: int,
2025-05-07T20:31:35.9703732Z         scale_ub: Optional[float],
2025-05-07T20:31:35.9704007Z         contiguous: bool,
2025-05-07T20:31:35.9704260Z         compiled: bool,
2025-05-07T20:31:35.9704495Z     ) -> None:
2025-05-07T20:31:35.9704724Z         torch.manual_seed(2025)
2025-05-07T20:31:35.9704973Z     
2025-05-07T20:31:35.9705257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:35.9705606Z     
2025-05-07T20:31:35.9705800Z         x_sign = torch.sign(x)
2025-05-07T20:31:35.9706117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:35.9706433Z         x = x_sign * x_clamp
2025-05-07T20:31:35.9706684Z         x0 = x[:, :D]
2025-05-07T20:31:35.9706911Z         x1 = x[:, D:]
2025-05-07T20:31:35.9707127Z     
2025-05-07T20:31:35.9707318Z         if contiguous:
2025-05-07T20:31:35.9707559Z             x0 = x0.contiguous()
2025-05-07T20:31:35.9707824Z             x1 = x1.contiguous()
2025-05-07T20:31:35.9708063Z     
2025-05-07T20:31:35.9708263Z         if scale_ub is not None:
2025-05-07T20:31:35.9708542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:35.9708875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:35.9709189Z             )
2025-05-07T20:31:35.9709389Z         else:
2025-05-07T20:31:35.9709602Z             scale_ub_tensor = None
2025-05-07T20:31:35.9709863Z     
2025-05-07T20:31:35.9710105Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.9710425Z             op = silu_mul_quant
2025-05-07T20:31:35.9710680Z             if compiled:
2025-05-07T20:31:35.9710937Z                 op = torch.compile(op)
2025-05-07T20:31:35.9711253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:35.9711529Z     
2025-05-07T20:31:35.9711735Z         y_fp8, y_scale = fn()
2025-05-07T20:31:35.9712027Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:35.9712314Z     
2025-05-07T20:31:35.9712559Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:35.9712902Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:35.9713193Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:35.9713511Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:35.9713872Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.9714187Z     
2025-05-07T20:31:35.9714392Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:35.9714595Z 
2025-05-07T20:31:35.9714699Z moe/activation_test.py:126: 
2025-05-07T20:31:35.9715353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.9715819Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:35.9716163Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:35.9716958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:35.9717716Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:35.9718266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:35.9718957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:35.9719647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:35.9720365Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.9721130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:35.9721886Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:35.9722616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:35.9723264Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:35.9723906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:35.9724438Z     fn()
2025-05-07T20:31:35.9724960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:35.9725537Z     self.fn.run(
2025-05-07T20:31:35.9726011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:35.9726559Z     kernel = self.compile(
2025-05-07T20:31:35.9727101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:35.9727757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:35.9728156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:35.9728408Z 
2025-05-07T20:31:35.9728630Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6c948640>
2025-05-07T20:31:35.9729707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:35.9731102Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7e5bea70>}
2025-05-07T20:31:35.9732445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:35.9733461Z context = <triton._C.libtriton.ir.context object at 0x7f1c56319eb0>
2025-05-07T20:31:35.9733754Z 
2025-05-07T20:31:35.9733924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:35.9734449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:35.9734914Z                            module_map=module_map)
2025-05-07T20:31:35.9735281Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:35.9735641Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:35.9735909Z E       ^
2025-05-07T20:31:35.9736369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:35.9736918Z 
2025-05-07T20:31:35.9737410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:35.9737925Z 
2025-05-07T20:31:35.9738032Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:35.9738445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:35.9738841Z     T=2048,
2025-05-07T20:31:35.9739035Z     D=5120,
2025-05-07T20:31:35.9739232Z     scale_ub=None,
2025-05-07T20:31:35.9739444Z     contiguous=True,
2025-05-07T20:31:35.9739672Z     compiled=True,
2025-05-07T20:31:35.9739969Z )
2025-05-07T20:31:36.4161909Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.4163019Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:36.4164384Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.4165804Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.4167177Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.4168558Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4169883Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.4171250Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4172662Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.4173903Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:36.4175113Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.4176318Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:36.4177342Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:36.4178340Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:36.4179547Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.4180916Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.4182592Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:36.4183624Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:36.4184783Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.4186121Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.4187172Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4188081Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4188812Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:36.4190081Z W0507 20:31:36.412000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.5788272Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.5789431Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:36.5791080Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.5792505Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.5793862Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.5795222Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.5796518Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.5797883Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.5799272Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.5800503Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:36.5801715Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.5803339Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:36.5804368Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:36.5805365Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:36.5806567Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.5807828Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.5808940Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:36.5809976Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:36.5811132Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.5812473Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.5813555Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.5814472Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.5815211Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:36.5816222Z W0507 20:31:36.575000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.0230856Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.0231932Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:37.0233276Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.0234796Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.0236156Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.0237521Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.0238804Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.0240582Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.0242002Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.0243250Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:37.0244458Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.0245677Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:37.0246702Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:37.0247709Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:37.0248907Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.0250168Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.0251263Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.0252306Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:37.0253467Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.0254820Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.0255864Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.0256754Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.0257495Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:37.0258510Z W0507 20:31:37.019000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.0529342Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.0530394Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:37.0531698Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.0533433Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.0534799Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.0536170Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.0537458Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.0538816Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.0540378Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.0541607Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:37.0542805Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.0543999Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:37.0545034Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:37.0546038Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:37.0547236Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.0548510Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.0549606Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.0550644Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:37.0551802Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.0553160Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.0554202Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.0555100Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.0556170Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:37.0557196Z W0507 20:31:37.049000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5098498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.5099270Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.5099546Z 
2025-05-07T20:31:37.5099628Z     @given(
2025-05-07T20:31:37.5099985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.5100300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.5100610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.5100945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.5101300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.5101580Z     )
2025-05-07T20:31:37.5101943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.5102388Z     def test_silu_mul_quant(
2025-05-07T20:31:37.5102630Z         self,
2025-05-07T20:31:37.5102829Z         T: int,
2025-05-07T20:31:37.5103031Z         D: int,
2025-05-07T20:31:37.5103249Z         scale_ub: Optional[float],
2025-05-07T20:31:37.5103525Z         contiguous: bool,
2025-05-07T20:31:37.5103800Z         compiled: bool,
2025-05-07T20:31:37.5104052Z     ) -> None:
2025-05-07T20:31:37.5104272Z         torch.manual_seed(2025)
2025-05-07T20:31:37.5104520Z     
2025-05-07T20:31:37.5104791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.5105133Z     
2025-05-07T20:31:37.5105328Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.5105615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.5105924Z         x = x_sign * x_clamp
2025-05-07T20:31:37.5106176Z         x0 = x[:, :D]
2025-05-07T20:31:37.5106392Z         x1 = x[:, D:]
2025-05-07T20:31:37.5106596Z     
2025-05-07T20:31:37.5106792Z         if contiguous:
2025-05-07T20:31:37.5107027Z             x0 = x0.contiguous()
2025-05-07T20:31:37.5107286Z             x1 = x1.contiguous()
2025-05-07T20:31:37.5107528Z     
2025-05-07T20:31:37.5107721Z         if scale_ub is not None:
2025-05-07T20:31:37.5107991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.5108328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.5108643Z             )
2025-05-07T20:31:37.5108832Z         else:
2025-05-07T20:31:37.5109048Z             scale_ub_tensor = None
2025-05-07T20:31:37.5109301Z     
2025-05-07T20:31:37.5109530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.5109847Z             op = silu_mul_quant
2025-05-07T20:31:37.5110107Z             if compiled:
2025-05-07T20:31:37.5110351Z                 op = torch.compile(op)
2025-05-07T20:31:37.5110654Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.5110930Z     
2025-05-07T20:31:37.5111129Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.5111413Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.5111703Z     
2025-05-07T20:31:37.5111954Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.5112291Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.5112582Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.5112898Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.5113260Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.5113573Z     
2025-05-07T20:31:37.5113773Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.5113973Z 
2025-05-07T20:31:37.5114077Z moe/activation_test.py:126: 
2025-05-07T20:31:37.5114378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.5115057Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.5115390Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.5116401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.5117167Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.5117706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.5118388Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.5119071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.5119782Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.5120546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.5121302Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.5122027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.5122659Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.5123256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.5123773Z     fn()
2025-05-07T20:31:37.5124280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.5124847Z     self.fn.run(
2025-05-07T20:31:37.5125311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.5125837Z     kernel = self.compile(
2025-05-07T20:31:37.5126375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.5127036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.5127432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.5127657Z 
2025-05-07T20:31:37.5127875Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6c9865c0>
2025-05-07T20:31:37.5128942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.5130319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c6c9f8dc0>}
2025-05-07T20:31:37.5131663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.5132697Z context = <triton._C.libtriton.ir.context object at 0x7f1c561210f0>
2025-05-07T20:31:37.5132983Z 
2025-05-07T20:31:37.5133157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.5133673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.5134137Z                            module_map=module_map)
2025-05-07T20:31:37.5134506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.5134858Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.5135130Z E       ^
2025-05-07T20:31:37.5135595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5136038Z 
2025-05-07T20:31:37.5136456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.5137048Z 
2025-05-07T20:31:37.5137153Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.5137646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.5138050Z     T=128,
2025-05-07T20:31:37.5138232Z     D=5120,
2025-05-07T20:31:37.5138431Z     scale_ub=None,
2025-05-07T20:31:37.5138655Z     contiguous=True,
2025-05-07T20:31:37.5138874Z     compiled=True,
2025-05-07T20:31:37.5139087Z )
2025-05-07T20:31:37.9828019Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.9829113Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:37.9830471Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.9831912Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.9833274Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.9834652Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.9835945Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.9837318Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.9838704Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.9839943Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:37.9841148Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.9842341Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:37.9843367Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:37.9844364Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:37.9845564Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.9846823Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.9848260Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.9849427Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:37.9850585Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.9851918Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.9852958Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.9853857Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.9854597Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:37.9855599Z W0507 20:31:37.979000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.1463486Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.1464606Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:38.1465927Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.1467395Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.1468769Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.1470147Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.1471454Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.1472830Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.1474226Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.1475463Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:38.1476670Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.1477870Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:38.1479378Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.1480400Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:38.1481596Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.1482866Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.1484015Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.1485054Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:38.1486207Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.1487549Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.1488594Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.1489495Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.1490515Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:38.1491515Z W0507 20:31:38.143000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.5958178Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.5959247Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:38.5960571Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.5962009Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.5963370Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.5964735Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.5966043Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.5967751Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.5969312Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.5970541Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:38.5971745Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.5980230Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:38.5981459Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.5982494Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:38.5983713Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.5984997Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.5986116Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.5987163Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:38.5988339Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.5989684Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.5991100Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.5992011Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.5992756Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:38.5993770Z W0507 20:31:38.592000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.6259458Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.6260636Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:38.6261956Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.6264182Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.6265565Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.6266937Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.6268234Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.6269598Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.6271007Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.6272235Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:38.6273433Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.6274625Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:38.6275652Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.6276664Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:38.6277872Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.6279134Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.6280234Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.6281260Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:38.6282421Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.6283756Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.6284813Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.6285712Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.6286448Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:38.6287601Z W0507 20:31:38.622000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.0383924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.0384531Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:39.0384805Z 
2025-05-07T20:31:39.0384885Z     @given(
2025-05-07T20:31:39.0385132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.0385447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.0385766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.0386105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.0386440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.0386722Z     )
2025-05-07T20:31:39.0387079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.0387559Z     def test_silu_mul_quant(
2025-05-07T20:31:39.0387812Z         self,
2025-05-07T20:31:39.0388026Z         T: int,
2025-05-07T20:31:39.0388232Z         D: int,
2025-05-07T20:31:39.0388455Z         scale_ub: Optional[float],
2025-05-07T20:31:39.0388736Z         contiguous: bool,
2025-05-07T20:31:39.0388983Z         compiled: bool,
2025-05-07T20:31:39.0389213Z     ) -> None:
2025-05-07T20:31:39.0389439Z         torch.manual_seed(2025)
2025-05-07T20:31:39.0389688Z     
2025-05-07T20:31:39.0390249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.0390604Z     
2025-05-07T20:31:39.0390808Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.0391100Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.0391428Z         x = x_sign * x_clamp
2025-05-07T20:31:39.0391688Z         x0 = x[:, :D]
2025-05-07T20:31:39.0391914Z         x1 = x[:, D:]
2025-05-07T20:31:39.0392139Z     
2025-05-07T20:31:39.0392326Z         if contiguous:
2025-05-07T20:31:39.0392568Z             x0 = x0.contiguous()
2025-05-07T20:31:39.0392842Z             x1 = x1.contiguous()
2025-05-07T20:31:39.0393082Z     
2025-05-07T20:31:39.0393283Z         if scale_ub is not None:
2025-05-07T20:31:39.0393565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.0393900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.0394244Z             )
2025-05-07T20:31:39.0394467Z         else:
2025-05-07T20:31:39.0394679Z             scale_ub_tensor = None
2025-05-07T20:31:39.0394935Z     
2025-05-07T20:31:39.0395175Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.0395491Z             op = silu_mul_quant
2025-05-07T20:31:39.0395752Z             if compiled:
2025-05-07T20:31:39.0396011Z                 op = torch.compile(op)
2025-05-07T20:31:39.0396311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.0396596Z     
2025-05-07T20:31:39.0396797Z         y_fp8, y_scale = fn()
2025-05-07T20:31:39.0397097Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:39.0397390Z     
2025-05-07T20:31:39.0397635Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.0397979Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:39.0398275Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:39.0398596Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:39.0398958Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.0399268Z     
2025-05-07T20:31:39.0399473Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:39.0399667Z 
2025-05-07T20:31:39.0399778Z moe/activation_test.py:126: 
2025-05-07T20:31:39.0400085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.0400418Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:39.0400755Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.0402072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:39.0402824Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:39.0403374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.0404056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.0404742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:39.0405457Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.0406212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:39.0406962Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.0407701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:39.0408337Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:39.0408937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:39.0409466Z     fn()
2025-05-07T20:31:39.0409971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:39.0410560Z     self.fn.run(
2025-05-07T20:31:39.0411029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.0411562Z     kernel = self.compile(
2025-05-07T20:31:39.0412097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.0412758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.0413165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.0413394Z 
2025-05-07T20:31:39.0413603Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c562c59c0>
2025-05-07T20:31:39.0414684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.0416079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c564dcb80>}
2025-05-07T20:31:39.0417409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.0418436Z context = <triton._C.libtriton.ir.context object at 0x7f1c55cfc030>
2025-05-07T20:31:39.0418724Z 
2025-05-07T20:31:39.0418898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.0419421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.0420034Z                            module_map=module_map)
2025-05-07T20:31:39.0420407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.0420759Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:39.0421031Z E       ^
2025-05-07T20:31:39.0421496Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.0421939Z 
2025-05-07T20:31:39.0422351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.0422874Z 
2025-05-07T20:31:39.0423077Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.0423495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.0424013Z     T=4096,
2025-05-07T20:31:39.0424229Z     D=5120,
2025-05-07T20:31:39.0424459Z     scale_ub=None,
2025-05-07T20:31:39.0424684Z     contiguous=True,
2025-05-07T20:31:39.0424907Z     compiled=True,
2025-05-07T20:31:39.0425125Z )
2025-05-07T20:31:39.5157767Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.5158841Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:39.5160175Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.5161625Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.5163001Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.5164381Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.5165680Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.5167050Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.5168463Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.5169718Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:39.5170927Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.5172131Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:39.5173181Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:39.5174212Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:39.5175422Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.5176699Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.5177809Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.5179217Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:39.5180520Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.5181876Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.5182931Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.5183839Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.5184591Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:39.5185608Z W0507 20:31:39.512000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.6802219Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.6803296Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:39.6804661Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.6806130Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.6807502Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.6808865Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.6810147Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.6811508Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.6812909Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.6814136Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:39.6815340Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.6816522Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:39.6817869Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:39.6819020Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:39.6820366Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.6821631Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.6822722Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.6823752Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:39.6824922Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.6826276Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.6827314Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.6828214Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.6828945Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:39.6829957Z W0507 20:31:39.676000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.1291416Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.1292478Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.1293813Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.1295231Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.1296637Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.1298007Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.1299297Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.1300814Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.1302694Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.1303926Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.1305142Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.1306321Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.1307339Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:40.1308351Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.1309555Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.1310815Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.1311924Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.1312955Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.1314126Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.1315460Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.1316497Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.1317395Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.1318126Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.1319140Z W0507 20:31:40.125000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.1584110Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.1585219Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.1586543Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.1587957Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.1589732Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.1591383Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.1592668Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.1594029Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.1595438Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.1596678Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.1597882Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.1599077Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.1600093Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:40.1601112Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.1602326Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.1603589Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.1604682Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.1605714Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.1606884Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.1608220Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.1609276Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.1610169Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.1610906Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.1612036Z W0507 20:31:40.155000 86845 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.5624069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.5624731Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.5625008Z 
2025-05-07T20:31:40.5625089Z     @given(
2025-05-07T20:31:40.5625330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.5625640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.5625955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.5626294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.5626626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.5626909Z     )
2025-05-07T20:31:40.5627270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.5627721Z     def test_silu_mul_quant(
2025-05-07T20:31:40.5627987Z         self,
2025-05-07T20:31:40.5628194Z         T: int,
2025-05-07T20:31:40.5628399Z         D: int,
2025-05-07T20:31:40.5628631Z         scale_ub: Optional[float],
2025-05-07T20:31:40.5628911Z         contiguous: bool,
2025-05-07T20:31:40.5629157Z         compiled: bool,
2025-05-07T20:31:40.5629387Z     ) -> None:
2025-05-07T20:31:40.5629612Z         torch.manual_seed(2025)
2025-05-07T20:31:40.5629863Z     
2025-05-07T20:31:40.5630141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.5630494Z     
2025-05-07T20:31:40.5630696Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.5630988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.5631304Z         x = x_sign * x_clamp
2025-05-07T20:31:40.5631556Z         x0 = x[:, :D]
2025-05-07T20:31:40.5631778Z         x1 = x[:, D:]
2025-05-07T20:31:40.5631985Z     
2025-05-07T20:31:40.5632183Z         if contiguous:
2025-05-07T20:31:40.5632429Z             x0 = x0.contiguous()
2025-05-07T20:31:40.5632687Z             x1 = x1.contiguous()
2025-05-07T20:31:40.5632933Z     
2025-05-07T20:31:40.5633142Z         if scale_ub is not None:
2025-05-07T20:31:40.5633414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.5633755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.5634074Z             )
2025-05-07T20:31:40.5634271Z         else:
2025-05-07T20:31:40.5634492Z             scale_ub_tensor = None
2025-05-07T20:31:40.5634752Z     
2025-05-07T20:31:40.5634986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.5635302Z             op = silu_mul_quant
2025-05-07T20:31:40.5635560Z             if compiled:
2025-05-07T20:31:40.5635811Z                 op = torch.compile(op)
2025-05-07T20:31:40.5636111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.5636387Z     
2025-05-07T20:31:40.5636591Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.5636878Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.5637184Z     
2025-05-07T20:31:40.5637427Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.5637766Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.5638066Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.5638385Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.5638741Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.5639054Z     
2025-05-07T20:31:40.5639269Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.5639463Z 
2025-05-07T20:31:40.5639567Z moe/activation_test.py:126: 
2025-05-07T20:31:40.5639873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.5640214Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.5640547Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.5641326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.5642607Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.5643165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.5643847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.5644539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.5645311Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.5646076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.5646821Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.5647549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.5648210Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.5648809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.5649324Z     fn()
2025-05-07T20:31:40.5649836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.5650415Z     self.fn.run(
2025-05-07T20:31:40.5650879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.5651407Z     kernel = self.compile(
2025-05-07T20:31:40.5651948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.5652605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.5652997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.5653236Z 
2025-05-07T20:31:40.5653450Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c56072ec0>
2025-05-07T20:31:40.5654524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.5655907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5658c0d0>}
2025-05-07T20:31:40.5657239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.5658250Z context = <triton._C.libtriton.ir.context object at 0x7f1c554f15f0>
2025-05-07T20:31:40.5658543Z 
2025-05-07T20:31:40.5658714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.5659241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.5659701Z                            module_map=module_map)
2025-05-07T20:31:40.5660325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.5660689Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.5660955Z E       ^
2025-05-07T20:31:40.5661414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.5661864Z 
2025-05-07T20:31:40.5662279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.5662788Z 
2025-05-07T20:31:40.5662899Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.5663314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.5663812Z     T=16384,
2025-05-07T20:31:40.5664013Z     D=5120,
2025-05-07T20:31:40.5664211Z     scale_ub=None,
2025-05-07T20:31:40.5664499Z     contiguous=True,
2025-05-07T20:31:40.5664732Z     compiled=True,
2025-05-07T20:31:40.5664940Z )
2025-05-07T20:31:40.6076844Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:40.6078106Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:40.6079471Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:40.6080457Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:40.6081584Z W0507 20:31:40.606000 86845 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:40.7102144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.7102907Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.7103294Z 
2025-05-07T20:31:40.7103397Z     @given(
2025-05-07T20:31:40.7103642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.7103965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.7104275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.7104617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.7104957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.7105248Z     )
2025-05-07T20:31:40.7105636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.7106082Z     def test_silu_mul_quant(
2025-05-07T20:31:40.7106346Z         self,
2025-05-07T20:31:40.7106547Z         T: int,
2025-05-07T20:31:40.7106775Z         D: int,
2025-05-07T20:31:40.7107010Z         scale_ub: Optional[float],
2025-05-07T20:31:40.7107288Z         contiguous: bool,
2025-05-07T20:31:40.7107531Z         compiled: bool,
2025-05-07T20:31:40.7121225Z     ) -> None:
2025-05-07T20:31:40.7121570Z         torch.manual_seed(2025)
2025-05-07T20:31:40.7121931Z     
2025-05-07T20:31:40.7122304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.7122693Z     
2025-05-07T20:31:40.7122926Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.7123235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.7123564Z         x = x_sign * x_clamp
2025-05-07T20:31:40.7123818Z         x0 = x[:, :D]
2025-05-07T20:31:40.7124053Z         x1 = x[:, D:]
2025-05-07T20:31:40.7124320Z     
2025-05-07T20:31:40.7124526Z         if contiguous:
2025-05-07T20:31:40.7124812Z             x0 = x0.contiguous()
2025-05-07T20:31:40.7125138Z             x1 = x1.contiguous()
2025-05-07T20:31:40.7125399Z     
2025-05-07T20:31:40.7125601Z         if scale_ub is not None:
2025-05-07T20:31:40.7125892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.7126245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.7126563Z             )
2025-05-07T20:31:40.7126804Z         else:
2025-05-07T20:31:40.7127036Z             scale_ub_tensor = None
2025-05-07T20:31:40.7127295Z     
2025-05-07T20:31:40.7127548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.7127881Z             op = silu_mul_quant
2025-05-07T20:31:40.7128153Z             if compiled:
2025-05-07T20:31:40.7128414Z                 op = torch.compile(op)
2025-05-07T20:31:40.7128729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.7129387Z     
2025-05-07T20:31:40.7129590Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.7129898Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.7130365Z     
2025-05-07T20:31:40.7130617Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.7130970Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.7131284Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.7131612Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.7131988Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.7132309Z     
2025-05-07T20:31:40.7132519Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.7132726Z 
2025-05-07T20:31:40.7132838Z moe/activation_test.py:126: 
2025-05-07T20:31:40.7133152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.7133501Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.7133835Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.7134648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.7135415Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.7135973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.7136728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.7137539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.7142906Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.7143725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.7165249Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.7166012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.7166655Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.7167244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.7167756Z     fn()
2025-05-07T20:31:40.7168260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.7168835Z     self.fn.run(
2025-05-07T20:31:40.7169302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.7169836Z     kernel = self.compile(
2025-05-07T20:31:40.7170370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.7171019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.7171419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.7171644Z 
2025-05-07T20:31:40.7171854Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c552c2050>
2025-05-07T20:31:40.7172929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.7174295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5663a3b0>}
2025-05-07T20:31:40.7175621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.7176790Z context = <triton._C.libtriton.ir.context object at 0x7f1c54d5bcb0>
2025-05-07T20:31:40.7177077Z 
2025-05-07T20:31:40.7177328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.7177850Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.7178313Z                            module_map=module_map)
2025-05-07T20:31:40.7178676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.7179035Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.7179301Z E       ^
2025-05-07T20:31:40.7179760Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.7180262Z 
2025-05-07T20:31:40.7180680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.7181195Z 
2025-05-07T20:31:40.7181310Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.7181716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.7182123Z     T=1,
2025-05-07T20:31:40.7182309Z     D=5120,
2025-05-07T20:31:40.7182509Z     scale_ub=1200.0,
2025-05-07T20:31:40.7182728Z     contiguous=True,
2025-05-07T20:31:40.7182947Z     compiled=True,
2025-05-07T20:31:40.7183150Z )
2025-05-07T20:31:41.0648002Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.0648742Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.0649111Z 
2025-05-07T20:31:41.0649200Z     @given(
2025-05-07T20:31:41.0649442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.0649764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.0650078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.0650415Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.0650782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.0651076Z     )
2025-05-07T20:31:41.0651449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.0651899Z     def test_silu_mul_quant(
2025-05-07T20:31:41.0652144Z         self,
2025-05-07T20:31:41.0652350Z         T: int,
2025-05-07T20:31:41.0652556Z         D: int,
2025-05-07T20:31:41.0652781Z         scale_ub: Optional[float],
2025-05-07T20:31:41.0653065Z         contiguous: bool,
2025-05-07T20:31:41.0653314Z         compiled: bool,
2025-05-07T20:31:41.0653546Z     ) -> None:
2025-05-07T20:31:41.0653776Z         torch.manual_seed(2025)
2025-05-07T20:31:41.0654035Z     
2025-05-07T20:31:41.0654319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.0654694Z     
2025-05-07T20:31:41.0654937Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.0655233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.0655553Z         x = x_sign * x_clamp
2025-05-07T20:31:41.0655816Z         x0 = x[:, :D]
2025-05-07T20:31:41.0656039Z         x1 = x[:, D:]
2025-05-07T20:31:41.0656262Z     
2025-05-07T20:31:41.0656469Z         if contiguous:
2025-05-07T20:31:41.0656717Z             x0 = x0.contiguous()
2025-05-07T20:31:41.0656985Z             x1 = x1.contiguous()
2025-05-07T20:31:41.0657223Z     
2025-05-07T20:31:41.0657421Z         if scale_ub is not None:
2025-05-07T20:31:41.0657701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.0658037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.0658349Z             )
2025-05-07T20:31:41.0658550Z         else:
2025-05-07T20:31:41.0658763Z             scale_ub_tensor = None
2025-05-07T20:31:41.0659020Z     
2025-05-07T20:31:41.0659261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.0659573Z             op = silu_mul_quant
2025-05-07T20:31:41.0659940Z             if compiled:
2025-05-07T20:31:41.0660201Z                 op = torch.compile(op)
2025-05-07T20:31:41.0660873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0661143Z     
2025-05-07T20:31:41.0661471Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.0661642Z 
2025-05-07T20:31:41.0661755Z moe/activation_test.py:117: 
2025-05-07T20:31:41.0662054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0662393Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.0662682Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0663242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.0663815Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.0664479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.0665170Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.0665707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.0666412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.0667079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.0667615Z     kernel = self.compile(
2025-05-07T20:31:41.0668161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.0668819Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.0669222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0669449Z 
2025-05-07T20:31:41.0669658Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54e75c60>
2025-05-07T20:31:41.0670736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.0672145Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5524feb0>}
2025-05-07T20:31:41.0673490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.0674513Z context = <triton._C.libtriton.ir.context object at 0x7f1c54dd3cb0>
2025-05-07T20:31:41.0674804Z 
2025-05-07T20:31:41.0674973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.0675501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.0675972Z                            module_map=module_map)
2025-05-07T20:31:41.0676340Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.0676704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.0676972Z E       ^
2025-05-07T20:31:41.0677451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.0677897Z 
2025-05-07T20:31:41.0678314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.0678830Z 
2025-05-07T20:31:41.0678942Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.0679365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.0679770Z     T=1,
2025-05-07T20:31:41.0679953Z     D=5120,
2025-05-07T20:31:41.0680154Z     scale_ub=None,
2025-05-07T20:31:41.0680379Z     contiguous=False,
2025-05-07T20:31:41.0680604Z     compiled=True,
2025-05-07T20:31:41.0680823Z )
2025-05-07T20:31:41.1361620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1362761Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.1363124Z 
2025-05-07T20:31:41.1363455Z     @given(
2025-05-07T20:31:41.1363710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1364038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1364348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1364687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1365020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1365311Z     )
2025-05-07T20:31:41.1365665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1366111Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1366367Z         self,
2025-05-07T20:31:41.1366568Z         T: int,
2025-05-07T20:31:41.1366773Z         D: int,
2025-05-07T20:31:41.1367004Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1367281Z         contiguous: bool,
2025-05-07T20:31:41.1367536Z         compiled: bool,
2025-05-07T20:31:41.1367780Z     ) -> None:
2025-05-07T20:31:41.1368016Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1368268Z     
2025-05-07T20:31:41.1368554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1368896Z     
2025-05-07T20:31:41.1369102Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1369406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1369719Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1369972Z         x0 = x[:, :D]
2025-05-07T20:31:41.1370206Z         x1 = x[:, D:]
2025-05-07T20:31:41.1370425Z     
2025-05-07T20:31:41.1370622Z         if contiguous:
2025-05-07T20:31:41.1370868Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1371138Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1371379Z     
2025-05-07T20:31:41.1371585Z         if scale_ub is not None:
2025-05-07T20:31:41.1371869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1372214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1372532Z             )
2025-05-07T20:31:41.1372736Z         else:
2025-05-07T20:31:41.1372954Z             scale_ub_tensor = None
2025-05-07T20:31:41.1373218Z     
2025-05-07T20:31:41.1373463Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1373778Z             op = silu_mul_quant
2025-05-07T20:31:41.1374041Z             if compiled:
2025-05-07T20:31:41.1374300Z                 op = torch.compile(op)
2025-05-07T20:31:41.1374602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1374882Z     
2025-05-07T20:31:41.1375088Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.1375383Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.1375670Z     
2025-05-07T20:31:41.1375919Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1376260Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.1376562Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.1376887Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.1377256Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1377567Z     
2025-05-07T20:31:41.1377780Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.1377977Z 
2025-05-07T20:31:41.1378091Z moe/activation_test.py:126: 
2025-05-07T20:31:41.1378392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1378740Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.1379074Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.1379964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.1380717Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.1381274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1382123Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1382824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.1383541Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1384299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.1385096Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.1385830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.1386465Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.1387071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.1387602Z     fn()
2025-05-07T20:31:41.1388121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.1388708Z     self.fn.run(
2025-05-07T20:31:41.1389187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1389731Z     kernel = self.compile(
2025-05-07T20:31:41.1390572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1391236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1391641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1391870Z 
2025-05-07T20:31:41.1392083Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54e74ca0>
2025-05-07T20:31:41.1393171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1394561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c564de290>}
2025-05-07T20:31:41.1395903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1396929Z context = <triton._C.libtriton.ir.context object at 0x7f1c54a17570>
2025-05-07T20:31:41.1397218Z 
2025-05-07T20:31:41.1397388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1397916Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1398395Z                            module_map=module_map)
2025-05-07T20:31:41.1398764Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1399134Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.1399406Z E       ^
2025-05-07T20:31:41.1399876Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1400327Z 
2025-05-07T20:31:41.1400746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1401260Z 
2025-05-07T20:31:41.1401369Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1401789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1402193Z     T=1,
2025-05-07T20:31:41.1402379Z     D=5120,
2025-05-07T20:31:41.1402583Z     scale_ub=None,
2025-05-07T20:31:41.1402806Z     contiguous=True,
2025-05-07T20:31:41.1403033Z     compiled=False,
2025-05-07T20:31:41.1403393Z )
2025-05-07T20:31:41.3051576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.3053119Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.3053525Z 
2025-05-07T20:31:41.3053657Z     @given(
2025-05-07T20:31:41.3054005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.3054478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.3054863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.3055192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.3055529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.3055815Z     )
2025-05-07T20:31:41.3056170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.3056611Z     def test_silu_mul_quant(
2025-05-07T20:31:41.3056858Z         self,
2025-05-07T20:31:41.3057060Z         T: int,
2025-05-07T20:31:41.3057271Z         D: int,
2025-05-07T20:31:41.3057499Z         scale_ub: Optional[float],
2025-05-07T20:31:41.3057784Z         contiguous: bool,
2025-05-07T20:31:41.3058034Z         compiled: bool,
2025-05-07T20:31:41.3058272Z     ) -> None:
2025-05-07T20:31:41.3058498Z         torch.manual_seed(2025)
2025-05-07T20:31:41.3058742Z     
2025-05-07T20:31:41.3059026Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.3059369Z     
2025-05-07T20:31:41.3059566Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.3059983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.3060309Z         x = x_sign * x_clamp
2025-05-07T20:31:41.3060552Z         x0 = x[:, :D]
2025-05-07T20:31:41.3060778Z         x1 = x[:, D:]
2025-05-07T20:31:41.3060992Z     
2025-05-07T20:31:41.3061189Z         if contiguous:
2025-05-07T20:31:41.3061430Z             x0 = x0.contiguous()
2025-05-07T20:31:41.3061685Z             x1 = x1.contiguous()
2025-05-07T20:31:41.3061930Z     
2025-05-07T20:31:41.3062140Z         if scale_ub is not None:
2025-05-07T20:31:41.3062419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.3062758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.3063080Z             )
2025-05-07T20:31:41.3063285Z         else:
2025-05-07T20:31:41.3063499Z             scale_ub_tensor = None
2025-05-07T20:31:41.3063756Z     
2025-05-07T20:31:41.3063995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.3064310Z             op = silu_mul_quant
2025-05-07T20:31:41.3064571Z             if compiled:
2025-05-07T20:31:41.3064827Z                 op = torch.compile(op)
2025-05-07T20:31:41.3065127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.3065405Z     
2025-05-07T20:31:41.3065607Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.3065775Z 
2025-05-07T20:31:41.3065880Z moe/activation_test.py:117: 
2025-05-07T20:31:41.3066185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.3066526Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.3066825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.3067520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.3068215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.3068761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.3069435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.3070096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.3070635Z     kernel = self.compile(
2025-05-07T20:31:41.3071179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.3071830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.3072401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.3072708Z 
2025-05-07T20:31:41.3072927Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c552b2590>
2025-05-07T20:31:41.3074004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.3075390Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5524fb50>}
2025-05-07T20:31:41.3076725Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.3077764Z context = <triton._C.libtriton.ir.context object at 0x7f1c54957e70>
2025-05-07T20:31:41.3078054Z 
2025-05-07T20:31:41.3078234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.3078748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.3079219Z                            module_map=module_map)
2025-05-07T20:31:41.3079589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.3079946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.3080206Z E       ^
2025-05-07T20:31:41.3080671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.3081116Z 
2025-05-07T20:31:41.3081536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.3082043Z 
2025-05-07T20:31:41.3082157Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.3082568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.3082968Z     T=128,
2025-05-07T20:31:41.3083172Z     D=5120,
2025-05-07T20:31:41.3083369Z     scale_ub=None,
2025-05-07T20:31:41.3083592Z     contiguous=False,
2025-05-07T20:31:41.3083826Z     compiled=True,
2025-05-07T20:31:41.3084031Z )
2025-05-07T20:31:41.3084362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.3084858Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.3085124Z 
2025-05-07T20:31:41.3085205Z     @given(
2025-05-07T20:31:41.3085444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.3085762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.3086075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.3086402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.3086738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.3087037Z     )
2025-05-07T20:31:41.3087391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.3087838Z     def test_silu_mul_quant(
2025-05-07T20:31:41.3088086Z         self,
2025-05-07T20:31:41.3088282Z         T: int,
2025-05-07T20:31:41.3088490Z         D: int,
2025-05-07T20:31:41.3088717Z         scale_ub: Optional[float],
2025-05-07T20:31:41.3088990Z         contiguous: bool,
2025-05-07T20:31:41.3089236Z         compiled: bool,
2025-05-07T20:31:41.3089463Z     ) -> None:
2025-05-07T20:31:41.3089678Z         torch.manual_seed(2025)
2025-05-07T20:31:41.3090211Z     
2025-05-07T20:31:41.3090492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.3090833Z     
2025-05-07T20:31:41.3091026Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.3099130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.3099498Z         x = x_sign * x_clamp
2025-05-07T20:31:41.3100141Z         x0 = x[:, :D]
2025-05-07T20:31:41.3100379Z         x1 = x[:, D:]
2025-05-07T20:31:41.3100599Z     
2025-05-07T20:31:41.3100921Z         if contiguous:
2025-05-07T20:31:41.3101180Z             x0 = x0.contiguous()
2025-05-07T20:31:41.3101453Z             x1 = x1.contiguous()
2025-05-07T20:31:41.3101714Z     
2025-05-07T20:31:41.3101923Z         if scale_ub is not None:
2025-05-07T20:31:41.3102208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.3102564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.3102890Z             )
2025-05-07T20:31:41.3103093Z         else:
2025-05-07T20:31:41.3103325Z             scale_ub_tensor = None
2025-05-07T20:31:41.3103594Z     
2025-05-07T20:31:41.3103845Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.3104172Z             op = silu_mul_quant
2025-05-07T20:31:41.3104446Z             if compiled:
2025-05-07T20:31:41.3104719Z                 op = torch.compile(op)
2025-05-07T20:31:41.3105033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.3105325Z     
2025-05-07T20:31:41.3105535Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.3105713Z 
2025-05-07T20:31:41.3105821Z moe/activation_test.py:117: 
2025-05-07T20:31:41.3106135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.3106486Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.3106778Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.3107353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.3107930Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.3108603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.3109295Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.3109849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.3110545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.3111228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.3111761Z     kernel = self.compile(
2025-05-07T20:31:41.3112315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.3112980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.3113383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.3113624Z 
2025-05-07T20:31:41.3113837Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c56537f40>
2025-05-07T20:31:41.3114925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.3116309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c557bdc60>}
2025-05-07T20:31:41.3117650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.3118668Z context = <triton._C.libtriton.ir.context object at 0x7f1c549f16f0>
2025-05-07T20:31:41.3118964Z 
2025-05-07T20:31:41.3119139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.3119671Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.3120155Z                            module_map=module_map)
2025-05-07T20:31:41.3120528Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.3120993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.3121268Z E       ^
2025-05-07T20:31:41.3121816Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.3122283Z 
2025-05-07T20:31:41.3122707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.3123230Z 
2025-05-07T20:31:41.3123343Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.3123773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.3124186Z     T=128,
2025-05-07T20:31:41.3124394Z     D=7168,
2025-05-07T20:31:41.3124609Z     scale_ub=1200.0,
2025-05-07T20:31:41.3124844Z     contiguous=False,
2025-05-07T20:31:41.3125088Z     compiled=False,
2025-05-07T20:31:41.3125316Z )
2025-05-07T20:31:41.4378870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.4379698Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.4380234Z 
2025-05-07T20:31:41.4380356Z     @given(
2025-05-07T20:31:41.4380680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.4381119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.4381476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.4381818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.4382154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.4382438Z     )
2025-05-07T20:31:41.4382797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.4383243Z     def test_silu_mul_quant(
2025-05-07T20:31:41.4383487Z         self,
2025-05-07T20:31:41.4383692Z         T: int,
2025-05-07T20:31:41.4383889Z         D: int,
2025-05-07T20:31:41.4384117Z         scale_ub: Optional[float],
2025-05-07T20:31:41.4384398Z         contiguous: bool,
2025-05-07T20:31:41.4384644Z         compiled: bool,
2025-05-07T20:31:41.4384876Z     ) -> None:
2025-05-07T20:31:41.4385099Z         torch.manual_seed(2025)
2025-05-07T20:31:41.4385347Z     
2025-05-07T20:31:41.4385630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.4385975Z     
2025-05-07T20:31:41.4386171Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.4386466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.4386781Z         x = x_sign * x_clamp
2025-05-07T20:31:41.4387022Z         x0 = x[:, :D]
2025-05-07T20:31:41.4387244Z         x1 = x[:, D:]
2025-05-07T20:31:41.4387458Z     
2025-05-07T20:31:41.4387647Z         if contiguous:
2025-05-07T20:31:41.4387878Z             x0 = x0.contiguous()
2025-05-07T20:31:41.4388141Z             x1 = x1.contiguous()
2025-05-07T20:31:41.4388386Z     
2025-05-07T20:31:41.4388575Z         if scale_ub is not None:
2025-05-07T20:31:41.4388852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.4389196Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.4389499Z             )
2025-05-07T20:31:41.4389705Z         else:
2025-05-07T20:31:41.4390200Z             scale_ub_tensor = None
2025-05-07T20:31:41.4390454Z     
2025-05-07T20:31:41.4390811Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.4391131Z             op = silu_mul_quant
2025-05-07T20:31:41.4391378Z             if compiled:
2025-05-07T20:31:41.4391633Z                 op = torch.compile(op)
2025-05-07T20:31:41.4391935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4392203Z     
2025-05-07T20:31:41.4392404Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.4392575Z 
2025-05-07T20:31:41.4392678Z moe/activation_test.py:117: 
2025-05-07T20:31:41.4392978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4393305Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.4393594Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4394784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.4395473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.4396012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.4396691Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.4397351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.4397881Z     kernel = self.compile(
2025-05-07T20:31:41.4398423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.4399077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.4399469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4399711Z 
2025-05-07T20:31:41.4399919Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54a065f0>
2025-05-07T20:31:41.4400996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.4402392Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c557bf9a0>}
2025-05-07T20:31:41.4403732Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.4404745Z context = <triton._C.libtriton.ir.context object at 0x7f1c5489c5b0>
2025-05-07T20:31:41.4405038Z 
2025-05-07T20:31:41.4405210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.4405736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.4406218Z                            module_map=module_map)
2025-05-07T20:31:41.4406582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.4406944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.4407212Z E       ^
2025-05-07T20:31:41.4407675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.4408128Z 
2025-05-07T20:31:41.4408539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.4409059Z 
2025-05-07T20:31:41.4409166Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.4409577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.4409974Z     T=128,
2025-05-07T20:31:41.4410166Z     D=5120,
2025-05-07T20:31:41.4410365Z     scale_ub=None,
2025-05-07T20:31:41.4410582Z     contiguous=False,
2025-05-07T20:31:41.4410821Z     compiled=False,
2025-05-07T20:31:41.4411036Z )
2025-05-07T20:31:41.4411353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.4411845Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.4412118Z 
2025-05-07T20:31:41.4412198Z     @given(
2025-05-07T20:31:41.4412432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.4412741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.4413050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.4413384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.4413708Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.4413998Z     )
2025-05-07T20:31:41.4414349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.4414911Z     def test_silu_mul_quant(
2025-05-07T20:31:41.4415183Z         self,
2025-05-07T20:31:41.4415460Z         T: int,
2025-05-07T20:31:41.4415661Z         D: int,
2025-05-07T20:31:41.4415885Z         scale_ub: Optional[float],
2025-05-07T20:31:41.4416164Z         contiguous: bool,
2025-05-07T20:31:41.4416411Z         compiled: bool,
2025-05-07T20:31:41.4416635Z     ) -> None:
2025-05-07T20:31:41.4416859Z         torch.manual_seed(2025)
2025-05-07T20:31:41.4417101Z     
2025-05-07T20:31:41.4417375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.4417740Z     
2025-05-07T20:31:41.4417945Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.4418235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.4418549Z         x = x_sign * x_clamp
2025-05-07T20:31:41.4418796Z         x0 = x[:, :D]
2025-05-07T20:31:41.4419015Z         x1 = x[:, D:]
2025-05-07T20:31:41.4419222Z     
2025-05-07T20:31:41.4419425Z         if contiguous:
2025-05-07T20:31:41.4419668Z             x0 = x0.contiguous()
2025-05-07T20:31:41.4420011Z             x1 = x1.contiguous()
2025-05-07T20:31:41.4420256Z     
2025-05-07T20:31:41.4420459Z         if scale_ub is not None:
2025-05-07T20:31:41.4420731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.4421068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.4421379Z             )
2025-05-07T20:31:41.4421573Z         else:
2025-05-07T20:31:41.4421790Z             scale_ub_tensor = None
2025-05-07T20:31:41.4422049Z     
2025-05-07T20:31:41.4422282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.4422597Z             op = silu_mul_quant
2025-05-07T20:31:41.4422853Z             if compiled:
2025-05-07T20:31:41.4423101Z                 op = torch.compile(op)
2025-05-07T20:31:41.4423399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4423674Z     
2025-05-07T20:31:41.4423867Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.4424046Z 
2025-05-07T20:31:41.4424148Z moe/activation_test.py:117: 
2025-05-07T20:31:41.4424451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4424789Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.4425068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4425753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.4426444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.4426974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.4427655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.4428318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.4428850Z     kernel = self.compile(
2025-05-07T20:31:41.4429388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.4430048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.4430448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4430675Z 
2025-05-07T20:31:41.4430887Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c55982830>
2025-05-07T20:31:41.4431951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.4433312Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c564df370>}
2025-05-07T20:31:41.4434760Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.4435850Z context = <triton._C.libtriton.ir.context object at 0x7f1c548aea30>
2025-05-07T20:31:41.4436135Z 
2025-05-07T20:31:41.4436301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.4436827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.4437297Z                            module_map=module_map)
2025-05-07T20:31:41.4437666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.4438017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.4438284Z E       ^
2025-05-07T20:31:41.4438751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.4439196Z 
2025-05-07T20:31:41.4439616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.4440131Z 
2025-05-07T20:31:41.4440245Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.4440661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.4441066Z     T=128,
2025-05-07T20:31:41.4441257Z     D=5120,
2025-05-07T20:31:41.4441460Z     scale_ub=1200.0,
2025-05-07T20:31:41.4441691Z     contiguous=True,
2025-05-07T20:31:41.4441912Z     compiled=False,
2025-05-07T20:31:41.4442123Z )
2025-05-07T20:31:41.6373925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.6374706Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.6375084Z 
2025-05-07T20:31:41.6375172Z     @given(
2025-05-07T20:31:41.6375403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.6375718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.6376056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.6376378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.6376720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.6377004Z     )
2025-05-07T20:31:41.6377351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.6377793Z     def test_silu_mul_quant(
2025-05-07T20:31:41.6378042Z         self,
2025-05-07T20:31:41.6378237Z         T: int,
2025-05-07T20:31:41.6378437Z         D: int,
2025-05-07T20:31:41.6378661Z         scale_ub: Optional[float],
2025-05-07T20:31:41.6378927Z         contiguous: bool,
2025-05-07T20:31:41.6379173Z         compiled: bool,
2025-05-07T20:31:41.6379404Z     ) -> None:
2025-05-07T20:31:41.6379617Z         torch.manual_seed(2025)
2025-05-07T20:31:41.6380002Z     
2025-05-07T20:31:41.6380279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.6380623Z     
2025-05-07T20:31:41.6380827Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.6381118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.6381440Z         x = x_sign * x_clamp
2025-05-07T20:31:41.6381686Z         x0 = x[:, :D]
2025-05-07T20:31:41.6381901Z         x1 = x[:, D:]
2025-05-07T20:31:41.6382117Z     
2025-05-07T20:31:41.6382310Z         if contiguous:
2025-05-07T20:31:41.6382545Z             x0 = x0.contiguous()
2025-05-07T20:31:41.6382808Z             x1 = x1.contiguous()
2025-05-07T20:31:41.6383055Z     
2025-05-07T20:31:41.6383248Z         if scale_ub is not None:
2025-05-07T20:31:41.6383527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.6383869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.6384183Z             )
2025-05-07T20:31:41.6384373Z         else:
2025-05-07T20:31:41.6384592Z             scale_ub_tensor = None
2025-05-07T20:31:41.6384846Z     
2025-05-07T20:31:41.6385080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.6385794Z             op = silu_mul_quant
2025-05-07T20:31:41.6386048Z             if compiled:
2025-05-07T20:31:41.6386425Z                 op = torch.compile(op)
2025-05-07T20:31:41.6386733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6387011Z     
2025-05-07T20:31:41.6387204Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.6387374Z 
2025-05-07T20:31:41.6387476Z moe/activation_test.py:117: 
2025-05-07T20:31:41.6387773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6388099Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.6388386Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6389073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.6389760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.6390577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.6391277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.6391973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.6392618Z     kernel = self.compile(
2025-05-07T20:31:41.6393159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.6393816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6394218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6394446Z 
2025-05-07T20:31:41.6394654Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54a06a10>
2025-05-07T20:31:41.6395727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.6397131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55825000>}
2025-05-07T20:31:41.6398469Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.6399481Z context = <triton._C.libtriton.ir.context object at 0x7f1c548d8970>
2025-05-07T20:31:41.6399766Z 
2025-05-07T20:31:41.6399935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.6400456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6400923Z                            module_map=module_map)
2025-05-07T20:31:41.6401287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6401649Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.6401913Z E       ^
2025-05-07T20:31:41.6402389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6402832Z 
2025-05-07T20:31:41.6403252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.6403765Z 
2025-05-07T20:31:41.6403875Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.6404288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.6404691Z     T=1,
2025-05-07T20:31:41.6404876Z     D=7168,
2025-05-07T20:31:41.6405077Z     scale_ub=1200.0,
2025-05-07T20:31:41.6405306Z     contiguous=True,
2025-05-07T20:31:41.6405529Z     compiled=True,
2025-05-07T20:31:41.6405741Z )
2025-05-07T20:31:41.6406064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.6406699Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.6406969Z 
2025-05-07T20:31:41.6407153Z     @given(
2025-05-07T20:31:41.6407398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.6407704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.6408014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.6408346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.6408682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.6408965Z     )
2025-05-07T20:31:41.6409324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.6409766Z     def test_silu_mul_quant(
2025-05-07T20:31:41.6410003Z         self,
2025-05-07T20:31:41.6410204Z         T: int,
2025-05-07T20:31:41.6410409Z         D: int,
2025-05-07T20:31:41.6410626Z         scale_ub: Optional[float],
2025-05-07T20:31:41.6410902Z         contiguous: bool,
2025-05-07T20:31:41.6411152Z         compiled: bool,
2025-05-07T20:31:41.6411373Z     ) -> None:
2025-05-07T20:31:41.6411600Z         torch.manual_seed(2025)
2025-05-07T20:31:41.6411847Z     
2025-05-07T20:31:41.6412116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.6412459Z     
2025-05-07T20:31:41.6412660Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.6412946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.6413261Z         x = x_sign * x_clamp
2025-05-07T20:31:41.6413509Z         x0 = x[:, :D]
2025-05-07T20:31:41.6413731Z         x1 = x[:, D:]
2025-05-07T20:31:41.6413936Z     
2025-05-07T20:31:41.6414125Z         if contiguous:
2025-05-07T20:31:41.6414360Z             x0 = x0.contiguous()
2025-05-07T20:31:41.6414614Z             x1 = x1.contiguous()
2025-05-07T20:31:41.6414854Z     
2025-05-07T20:31:41.6415060Z         if scale_ub is not None:
2025-05-07T20:31:41.6415333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.6415678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.6415989Z             )
2025-05-07T20:31:41.6416182Z         else:
2025-05-07T20:31:41.6416402Z             scale_ub_tensor = None
2025-05-07T20:31:41.6416658Z     
2025-05-07T20:31:41.6416889Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.6417205Z             op = silu_mul_quant
2025-05-07T20:31:41.6417462Z             if compiled:
2025-05-07T20:31:41.6417711Z                 op = torch.compile(op)
2025-05-07T20:31:41.6418018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6418300Z     
2025-05-07T20:31:41.6418503Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.6418668Z 
2025-05-07T20:31:41.6418770Z moe/activation_test.py:117: 
2025-05-07T20:31:41.6419069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6419403Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.6419684Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6420318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.6420887Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.6421549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.6422241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.6422778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.6423463Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.6424117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.6424649Z     kernel = self.compile(
2025-05-07T20:31:41.6425212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.6425974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6426467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6426703Z 
2025-05-07T20:31:41.6426913Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c55c463b0>
2025-05-07T20:31:41.6427983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.6429341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55825360>}
2025-05-07T20:31:41.6438623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.6439743Z context = <triton._C.libtriton.ir.context object at 0x7f1c54b044b0>
2025-05-07T20:31:41.6440042Z 
2025-05-07T20:31:41.6440215Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.6440746Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6441218Z                            module_map=module_map)
2025-05-07T20:31:41.6441597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6441959Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.6442230Z E       ^
2025-05-07T20:31:41.6442712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6443158Z 
2025-05-07T20:31:41.6443577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.6444101Z 
2025-05-07T20:31:41.6444212Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.6444639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.6445046Z     T=1,
2025-05-07T20:31:41.6445234Z     D=7168,
2025-05-07T20:31:41.6445440Z     scale_ub=1200.0,
2025-05-07T20:31:41.6445683Z     contiguous=False,
2025-05-07T20:31:41.6445913Z     compiled=True,
2025-05-07T20:31:41.6446128Z )
2025-05-07T20:31:41.9915185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.9915901Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.9916257Z 
2025-05-07T20:31:41.9916361Z     @given(
2025-05-07T20:31:41.9916668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.9917000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.9917315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.9917656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.9918024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.9918312Z     )
2025-05-07T20:31:41.9918689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.9919143Z     def test_silu_mul_quant(
2025-05-07T20:31:41.9919399Z         self,
2025-05-07T20:31:41.9919605Z         T: int,
2025-05-07T20:31:41.9919818Z         D: int,
2025-05-07T20:31:41.9920052Z         scale_ub: Optional[float],
2025-05-07T20:31:41.9920332Z         contiguous: bool,
2025-05-07T20:31:41.9920590Z         compiled: bool,
2025-05-07T20:31:41.9920831Z     ) -> None:
2025-05-07T20:31:41.9921054Z         torch.manual_seed(2025)
2025-05-07T20:31:41.9921310Z     
2025-05-07T20:31:41.9921598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.9921944Z     
2025-05-07T20:31:41.9922155Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.9922463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.9923154Z         x = x_sign * x_clamp
2025-05-07T20:31:41.9923409Z         x0 = x[:, :D]
2025-05-07T20:31:41.9923637Z         x1 = x[:, D:]
2025-05-07T20:31:41.9923979Z     
2025-05-07T20:31:41.9924187Z         if contiguous:
2025-05-07T20:31:41.9924435Z             x0 = x0.contiguous()
2025-05-07T20:31:41.9924714Z             x1 = x1.contiguous()
2025-05-07T20:31:41.9924960Z     
2025-05-07T20:31:41.9925167Z         if scale_ub is not None:
2025-05-07T20:31:41.9925453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.9925795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.9926118Z             )
2025-05-07T20:31:41.9926330Z         else:
2025-05-07T20:31:41.9926549Z             scale_ub_tensor = None
2025-05-07T20:31:41.9926817Z     
2025-05-07T20:31:41.9927065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.9927384Z             op = silu_mul_quant
2025-05-07T20:31:41.9927652Z             if compiled:
2025-05-07T20:31:41.9927928Z                 op = torch.compile(op)
2025-05-07T20:31:41.9928253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.9928544Z     
2025-05-07T20:31:41.9928754Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.9928924Z 
2025-05-07T20:31:41.9929040Z moe/activation_test.py:117: 
2025-05-07T20:31:41.9929341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.9929687Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.9929982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.9930549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.9931131Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.9931788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.9932481Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.9933026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.9933717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.9934377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.9934918Z     kernel = self.compile(
2025-05-07T20:31:41.9935516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.9936178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.9936575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.9936812Z 
2025-05-07T20:31:41.9937024Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54ac7850>
2025-05-07T20:31:41.9938107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.9939494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55825630>}
2025-05-07T20:31:41.9940931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.9941952Z context = <triton._C.libtriton.ir.context object at 0x7f1c54aca130>
2025-05-07T20:31:41.9942247Z 
2025-05-07T20:31:41.9942421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.9942951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.9943419Z                            module_map=module_map)
2025-05-07T20:31:41.9943888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.9944251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.9944612Z E       ^
2025-05-07T20:31:41.9945087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.9945540Z 
2025-05-07T20:31:41.9945956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.9946467Z 
2025-05-07T20:31:41.9946584Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.9946996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.9947407Z     T=1,
2025-05-07T20:31:41.9947605Z     D=7168,
2025-05-07T20:31:41.9947802Z     scale_ub=None,
2025-05-07T20:31:41.9948032Z     contiguous=False,
2025-05-07T20:31:41.9948268Z     compiled=True,
2025-05-07T20:31:41.9948479Z )
2025-05-07T20:31:42.0906685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.0907428Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:42.0907691Z 
2025-05-07T20:31:42.0907771Z     @given(
2025-05-07T20:31:42.0908012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.0908324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.0908625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.0908958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.0909290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.0909570Z     )
2025-05-07T20:31:42.0909926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.0910375Z     def test_silu_mul_quant(
2025-05-07T20:31:42.0910641Z         self,
2025-05-07T20:31:42.0910845Z         T: int,
2025-05-07T20:31:42.0911058Z         D: int,
2025-05-07T20:31:42.0911298Z         scale_ub: Optional[float],
2025-05-07T20:31:42.0911599Z         contiguous: bool,
2025-05-07T20:31:42.0911863Z         compiled: bool,
2025-05-07T20:31:42.0912119Z     ) -> None:
2025-05-07T20:31:42.0912347Z         torch.manual_seed(2025)
2025-05-07T20:31:42.0912617Z     
2025-05-07T20:31:42.0912919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.0913307Z     
2025-05-07T20:31:42.0913514Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.0913839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.0914182Z         x = x_sign * x_clamp
2025-05-07T20:31:42.0914445Z         x0 = x[:, :D]
2025-05-07T20:31:42.0914680Z         x1 = x[:, D:]
2025-05-07T20:31:42.0914898Z     
2025-05-07T20:31:42.0915098Z         if contiguous:
2025-05-07T20:31:42.0915352Z             x0 = x0.contiguous()
2025-05-07T20:31:42.0915638Z             x1 = x1.contiguous()
2025-05-07T20:31:42.0915897Z     
2025-05-07T20:31:42.0916103Z         if scale_ub is not None:
2025-05-07T20:31:42.0916413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.0916781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.0917135Z             )
2025-05-07T20:31:42.0917342Z         else:
2025-05-07T20:31:42.0917562Z             scale_ub_tensor = None
2025-05-07T20:31:42.0917841Z     
2025-05-07T20:31:42.0918095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.0918443Z             op = silu_mul_quant
2025-05-07T20:31:42.0918720Z             if compiled:
2025-05-07T20:31:42.0918992Z                 op = torch.compile(op)
2025-05-07T20:31:42.0919317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.0919626Z     
2025-05-07T20:31:42.0919834Z         y_fp8, y_scale = fn()
2025-05-07T20:31:42.0920144Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:42.0920478Z     
2025-05-07T20:31:42.0920736Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.0921114Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:42.0921752Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:42.0922206Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:42.0922574Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.0922881Z     
2025-05-07T20:31:42.0923086Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:42.0923280Z 
2025-05-07T20:31:42.0923389Z moe/activation_test.py:126: 
2025-05-07T20:31:42.0923683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.0924019Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:42.0924346Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.0925134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:42.0925881Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:42.0926433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.0927129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.0927808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:42.0928526Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.0929276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:42.0930026Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.0930744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:42.0931389Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:42.0931992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:42.0932513Z     fn()
2025-05-07T20:31:42.0933023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:42.0933601Z     self.fn.run(
2025-05-07T20:31:42.0934066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.0934587Z     kernel = self.compile(
2025-05-07T20:31:42.0935126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.0935785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.0936185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.0936410Z 
2025-05-07T20:31:42.0936617Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54a86a10>
2025-05-07T20:31:42.0937699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.0939075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55826440>}
2025-05-07T20:31:42.0940505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.0941522Z context = <triton._C.libtriton.ir.context object at 0x7f1c547e8d30>
2025-05-07T20:31:42.0941807Z 
2025-05-07T20:31:42.0941973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.0942492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.0943048Z                            module_map=module_map)
2025-05-07T20:31:42.0943522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.0943878Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:42.0944147Z E       ^
2025-05-07T20:31:42.0944610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.0945059Z 
2025-05-07T20:31:42.0945472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.0945985Z 
2025-05-07T20:31:42.0946088Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.0946498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.0946889Z     T=1,
2025-05-07T20:31:42.0947083Z     D=5120,
2025-05-07T20:31:42.0947280Z     scale_ub=1200.0,
2025-05-07T20:31:42.0947506Z     contiguous=False,
2025-05-07T20:31:42.0947734Z     compiled=True,
2025-05-07T20:31:42.0947943Z )
2025-05-07T20:31:42.2624965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.2625678Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.2626060Z 
2025-05-07T20:31:42.2626173Z     @given(
2025-05-07T20:31:42.2626499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.2626944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.2627302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.2627651Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.2627998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.2628301Z     )
2025-05-07T20:31:42.2628658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.2629112Z     def test_silu_mul_quant(
2025-05-07T20:31:42.2629367Z         self,
2025-05-07T20:31:42.2629583Z         T: int,
2025-05-07T20:31:42.2629795Z         D: int,
2025-05-07T20:31:42.2630031Z         scale_ub: Optional[float],
2025-05-07T20:31:42.2630312Z         contiguous: bool,
2025-05-07T20:31:42.2630569Z         compiled: bool,
2025-05-07T20:31:42.2630808Z     ) -> None:
2025-05-07T20:31:42.2631029Z         torch.manual_seed(2025)
2025-05-07T20:31:42.2631284Z     
2025-05-07T20:31:42.2631569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.2631915Z     
2025-05-07T20:31:42.2632118Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.2632420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.2632732Z         x = x_sign * x_clamp
2025-05-07T20:31:42.2632984Z         x0 = x[:, :D]
2025-05-07T20:31:42.2633212Z         x1 = x[:, D:]
2025-05-07T20:31:42.2633424Z     
2025-05-07T20:31:42.2633620Z         if contiguous:
2025-05-07T20:31:42.2633864Z             x0 = x0.contiguous()
2025-05-07T20:31:42.2634132Z             x1 = x1.contiguous()
2025-05-07T20:31:42.2634385Z     
2025-05-07T20:31:42.2634592Z         if scale_ub is not None:
2025-05-07T20:31:42.2634880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.2635224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.2635543Z             )
2025-05-07T20:31:42.2635749Z         else:
2025-05-07T20:31:42.2635968Z             scale_ub_tensor = None
2025-05-07T20:31:42.2636234Z     
2025-05-07T20:31:42.2636475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.2636794Z             op = silu_mul_quant
2025-05-07T20:31:42.2637057Z             if compiled:
2025-05-07T20:31:42.2637316Z                 op = torch.compile(op)
2025-05-07T20:31:42.2637627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2637910Z     
2025-05-07T20:31:42.2638105Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.2638280Z 
2025-05-07T20:31:42.2638385Z moe/activation_test.py:117: 
2025-05-07T20:31:42.2638692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2639386Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.2639808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2640384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.2640953Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.2641616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.2642314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.2642862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.2643547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.2644219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.2644769Z     kernel = self.compile(
2025-05-07T20:31:42.2645375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.2646040Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2646448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2646683Z 
2025-05-07T20:31:42.2646904Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54692c50>
2025-05-07T20:31:42.2647986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.2649384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55076170>}
2025-05-07T20:31:42.2650741Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.2651771Z context = <triton._C.libtriton.ir.context object at 0x7f1c547593f0>
2025-05-07T20:31:42.2652063Z 
2025-05-07T20:31:42.2652241Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.2652766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2653242Z                            module_map=module_map)
2025-05-07T20:31:42.2653618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2653980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2654243Z E       ^
2025-05-07T20:31:42.2654712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2655163Z 
2025-05-07T20:31:42.2655594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.2656116Z 
2025-05-07T20:31:42.2656230Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.2656640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.2657045Z     T=1,
2025-05-07T20:31:42.2657237Z     D=5120,
2025-05-07T20:31:42.2657431Z     scale_ub=1200.0,
2025-05-07T20:31:42.2657681Z     contiguous=False,
2025-05-07T20:31:42.2657915Z     compiled=False,
2025-05-07T20:31:42.2658130Z )
2025-05-07T20:31:42.2658446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.2658939Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.2659208Z 
2025-05-07T20:31:42.2659296Z     @given(
2025-05-07T20:31:42.2659525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.2659944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.2660350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.2660751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.2661093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.2661385Z     )
2025-05-07T20:31:42.2661744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.2662181Z     def test_silu_mul_quant(
2025-05-07T20:31:42.2662429Z         self,
2025-05-07T20:31:42.2662630Z         T: int,
2025-05-07T20:31:42.2662828Z         D: int,
2025-05-07T20:31:42.2663056Z         scale_ub: Optional[float],
2025-05-07T20:31:42.2663334Z         contiguous: bool,
2025-05-07T20:31:42.2663574Z         compiled: bool,
2025-05-07T20:31:42.2663809Z     ) -> None:
2025-05-07T20:31:42.2664035Z         torch.manual_seed(2025)
2025-05-07T20:31:42.2664283Z     
2025-05-07T20:31:42.2664564Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.2664922Z     
2025-05-07T20:31:42.2665124Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.2665471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.2665788Z         x = x_sign * x_clamp
2025-05-07T20:31:42.2666035Z         x0 = x[:, :D]
2025-05-07T20:31:42.2666253Z         x1 = x[:, D:]
2025-05-07T20:31:42.2666469Z     
2025-05-07T20:31:42.2666667Z         if contiguous:
2025-05-07T20:31:42.2666901Z             x0 = x0.contiguous()
2025-05-07T20:31:42.2667165Z             x1 = x1.contiguous()
2025-05-07T20:31:42.2667409Z     
2025-05-07T20:31:42.2667602Z         if scale_ub is not None:
2025-05-07T20:31:42.2667881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.2668218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.2668522Z             )
2025-05-07T20:31:42.2668721Z         else:
2025-05-07T20:31:42.2668942Z             scale_ub_tensor = None
2025-05-07T20:31:42.2669195Z     
2025-05-07T20:31:42.2669432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.2669753Z             op = silu_mul_quant
2025-05-07T20:31:42.2670012Z             if compiled:
2025-05-07T20:31:42.2670267Z                 op = torch.compile(op)
2025-05-07T20:31:42.2670569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2670841Z     
2025-05-07T20:31:42.2671041Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.2671214Z 
2025-05-07T20:31:42.2671316Z moe/activation_test.py:117: 
2025-05-07T20:31:42.2671619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2671950Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.2672236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2672923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.2673613Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.2674158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.2674862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.2675578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.2676108Z     kernel = self.compile(
2025-05-07T20:31:42.2676652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.2677314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2677706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2677943Z 
2025-05-07T20:31:42.2678150Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54750700>
2025-05-07T20:31:42.2679228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.2680759Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55075e10>}
2025-05-07T20:31:42.2682115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.2683130Z context = <triton._C.libtriton.ir.context object at 0x7f1c5474daf0>
2025-05-07T20:31:42.2683427Z 
2025-05-07T20:31:42.2683595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.2684125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2684594Z                            module_map=module_map)
2025-05-07T20:31:42.2684969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2685330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2685607Z E       ^
2025-05-07T20:31:42.2686075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2686726Z 
2025-05-07T20:31:42.2687146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.2687663Z 
2025-05-07T20:31:42.2687770Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.2688188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.2688590Z     T=16384,
2025-05-07T20:31:42.2688789Z     D=5120,
2025-05-07T20:31:42.2688992Z     scale_ub=1200.0,
2025-05-07T20:31:42.2689218Z     contiguous=False,
2025-05-07T20:31:42.2689448Z     compiled=True,
2025-05-07T20:31:42.2689675Z )
2025-05-07T20:31:42.3691836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.3692957Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.3693574Z 
2025-05-07T20:31:42.3693736Z     @given(
2025-05-07T20:31:42.3694205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.3694684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.3703433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.3703846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.3704209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.3704507Z     )
2025-05-07T20:31:42.3704879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.3705336Z     def test_silu_mul_quant(
2025-05-07T20:31:42.3705589Z         self,
2025-05-07T20:31:42.3705803Z         T: int,
2025-05-07T20:31:42.3706019Z         D: int,
2025-05-07T20:31:42.3706250Z         scale_ub: Optional[float],
2025-05-07T20:31:42.3706555Z         contiguous: bool,
2025-05-07T20:31:42.3706811Z         compiled: bool,
2025-05-07T20:31:42.3707092Z     ) -> None:
2025-05-07T20:31:42.3707331Z         torch.manual_seed(2025)
2025-05-07T20:31:42.3707586Z     
2025-05-07T20:31:42.3707877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.3708235Z     
2025-05-07T20:31:42.3708438Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.3708740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.3709054Z         x = x_sign * x_clamp
2025-05-07T20:31:42.3709312Z         x0 = x[:, :D]
2025-05-07T20:31:42.3709536Z         x1 = x[:, D:]
2025-05-07T20:31:42.3709760Z     
2025-05-07T20:31:42.3709959Z         if contiguous:
2025-05-07T20:31:42.3710199Z             x0 = x0.contiguous()
2025-05-07T20:31:42.3710474Z             x1 = x1.contiguous()
2025-05-07T20:31:42.3710727Z     
2025-05-07T20:31:42.3710924Z         if scale_ub is not None:
2025-05-07T20:31:42.3711210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.3711928Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.3712372Z             )
2025-05-07T20:31:42.3712587Z         else:
2025-05-07T20:31:42.3712818Z             scale_ub_tensor = None
2025-05-07T20:31:42.3713078Z     
2025-05-07T20:31:42.3713326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.3713659Z             op = silu_mul_quant
2025-05-07T20:31:42.3713918Z             if compiled:
2025-05-07T20:31:42.3714178Z                 op = torch.compile(op)
2025-05-07T20:31:42.3714485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.3714770Z     
2025-05-07T20:31:42.3714972Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.3715152Z 
2025-05-07T20:31:42.3715259Z moe/activation_test.py:117: 
2025-05-07T20:31:42.3715566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.3715908Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.3716208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.3716785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.3717350Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.3718021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.3718718Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.3719270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.3719955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.3720635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.3721181Z     kernel = self.compile(
2025-05-07T20:31:42.3721728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.3722401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.3722820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.3723054Z 
2025-05-07T20:31:42.3723277Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54690e50>
2025-05-07T20:31:42.3724357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.3725764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c550743a0>}
2025-05-07T20:31:42.3727119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.3728164Z context = <triton._C.libtriton.ir.context object at 0x7f1c5455d170>
2025-05-07T20:31:42.3728456Z 
2025-05-07T20:31:42.3728636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.3729167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.3729646Z                            module_map=module_map)
2025-05-07T20:31:42.3730033Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.3730392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.3730663Z E       ^
2025-05-07T20:31:42.3731142Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.3731595Z 
2025-05-07T20:31:42.3732030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.3732641Z 
2025-05-07T20:31:42.3732752Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.3733242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.3733656Z     T=2048,
2025-05-07T20:31:42.3733851Z     D=7168,
2025-05-07T20:31:42.3734051Z     scale_ub=1200.0,
2025-05-07T20:31:42.3734272Z     contiguous=False,
2025-05-07T20:31:42.3734505Z     compiled=True,
2025-05-07T20:31:42.3734717Z )
2025-05-07T20:31:42.3735035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.3735536Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.3735818Z 
2025-05-07T20:31:42.3735902Z     @given(
2025-05-07T20:31:42.3736145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.3736457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.3736774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.3737120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.3737451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.3737751Z     )
2025-05-07T20:31:42.3738109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.3738546Z     def test_silu_mul_quant(
2025-05-07T20:31:42.3738793Z         self,
2025-05-07T20:31:42.3739001Z         T: int,
2025-05-07T20:31:42.3739201Z         D: int,
2025-05-07T20:31:42.3739432Z         scale_ub: Optional[float],
2025-05-07T20:31:42.3739710Z         contiguous: bool,
2025-05-07T20:31:42.3740076Z         compiled: bool,
2025-05-07T20:31:42.3740300Z     ) -> None:
2025-05-07T20:31:42.3740524Z         torch.manual_seed(2025)
2025-05-07T20:31:42.3740771Z     
2025-05-07T20:31:42.3741045Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.3741395Z     
2025-05-07T20:31:42.3741593Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.3741888Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.3742208Z         x = x_sign * x_clamp
2025-05-07T20:31:42.3742452Z         x0 = x[:, :D]
2025-05-07T20:31:42.3742674Z         x1 = x[:, D:]
2025-05-07T20:31:42.3742889Z     
2025-05-07T20:31:42.3743082Z         if contiguous:
2025-05-07T20:31:42.3743316Z             x0 = x0.contiguous()
2025-05-07T20:31:42.3743577Z             x1 = x1.contiguous()
2025-05-07T20:31:42.3743820Z     
2025-05-07T20:31:42.3744015Z         if scale_ub is not None:
2025-05-07T20:31:42.3744296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.3744637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.3744950Z             )
2025-05-07T20:31:42.3745150Z         else:
2025-05-07T20:31:42.3745372Z             scale_ub_tensor = None
2025-05-07T20:31:42.3745635Z     
2025-05-07T20:31:42.3745870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.3746195Z             op = silu_mul_quant
2025-05-07T20:31:42.3746461Z             if compiled:
2025-05-07T20:31:42.3746714Z                 op = torch.compile(op)
2025-05-07T20:31:42.3747021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.3747299Z     
2025-05-07T20:31:42.3747497Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.3747670Z 
2025-05-07T20:31:42.3747772Z moe/activation_test.py:117: 
2025-05-07T20:31:42.3748082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.3748414Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.3748702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.3749266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.3749831Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.3750489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.3751177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.3751808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.3752558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.3753228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.3753761Z     kernel = self.compile(
2025-05-07T20:31:42.3754305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.3754964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.3755419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.3755650Z 
2025-05-07T20:31:42.3755866Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54693ca0>
2025-05-07T20:31:42.3756944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.3758310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55075fc0>}
2025-05-07T20:31:42.3759647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.3760687Z context = <triton._C.libtriton.ir.context object at 0x7f1c545b6930>
2025-05-07T20:31:42.3760975Z 
2025-05-07T20:31:42.3761151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.3761670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.3762144Z                            module_map=module_map)
2025-05-07T20:31:42.3762514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.3762878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.3763139Z E       ^
2025-05-07T20:31:42.3763616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.3764064Z 
2025-05-07T20:31:42.3764484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.3764994Z 
2025-05-07T20:31:42.5040344Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.5040996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.5041575Z     T=1,
2025-05-07T20:31:42.5041839Z     D=5120,
2025-05-07T20:31:42.5042109Z     scale_ub=None,
2025-05-07T20:31:42.5042343Z     contiguous=False,
2025-05-07T20:31:42.5042588Z     compiled=False,
2025-05-07T20:31:42.5042816Z )
2025-05-07T20:31:42.5043148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.5043651Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:42.5043912Z 
2025-05-07T20:31:42.5044001Z     @given(
2025-05-07T20:31:42.5044236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.5044557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.5044867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.5045200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.5045649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.5045949Z     )
2025-05-07T20:31:42.5046306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.5046841Z     def test_silu_mul_quant(
2025-05-07T20:31:42.5047129Z         self,
2025-05-07T20:31:42.5047357Z         T: int,
2025-05-07T20:31:42.5047562Z         D: int,
2025-05-07T20:31:42.5048006Z         scale_ub: Optional[float],
2025-05-07T20:31:42.5048288Z         contiguous: bool,
2025-05-07T20:31:42.5048694Z         compiled: bool,
2025-05-07T20:31:42.5048921Z     ) -> None:
2025-05-07T20:31:42.5049147Z         torch.manual_seed(2025)
2025-05-07T20:31:42.5049401Z     
2025-05-07T20:31:42.5049673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.5050029Z     
2025-05-07T20:31:42.5050227Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.5050528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.5050834Z         x = x_sign * x_clamp
2025-05-07T20:31:42.5051086Z         x0 = x[:, :D]
2025-05-07T20:31:42.5051316Z         x1 = x[:, D:]
2025-05-07T20:31:42.5051526Z     
2025-05-07T20:31:42.5051720Z         if contiguous:
2025-05-07T20:31:42.5051961Z             x0 = x0.contiguous()
2025-05-07T20:31:42.5052219Z             x1 = x1.contiguous()
2025-05-07T20:31:42.5052456Z     
2025-05-07T20:31:42.5052664Z         if scale_ub is not None:
2025-05-07T20:31:42.5052937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.5053280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.5053597Z             )
2025-05-07T20:31:42.5053790Z         else:
2025-05-07T20:31:42.5054012Z             scale_ub_tensor = None
2025-05-07T20:31:42.5054273Z     
2025-05-07T20:31:42.5054517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.5054837Z             op = silu_mul_quant
2025-05-07T20:31:42.5055093Z             if compiled:
2025-05-07T20:31:42.5055349Z                 op = torch.compile(op)
2025-05-07T20:31:42.5055648Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5055926Z     
2025-05-07T20:31:42.5056124Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.5056289Z 
2025-05-07T20:31:42.5056390Z moe/activation_test.py:117: 
2025-05-07T20:31:42.5056690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5057031Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.5057309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5058010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.5058701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.5059238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.5059984Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.5060650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.5061179Z     kernel = self.compile(
2025-05-07T20:31:42.5061716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.5062369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.5062773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5063001Z 
2025-05-07T20:31:42.5063223Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54690b20>
2025-05-07T20:31:42.5064294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.5065676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55077490>}
2025-05-07T20:31:42.5067007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.5068024Z context = <triton._C.libtriton.ir.context object at 0x7f1c544af270>
2025-05-07T20:31:42.5068398Z 
2025-05-07T20:31:42.5068644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.5069164Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.5069635Z                            module_map=module_map)
2025-05-07T20:31:42.5070012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.5070362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.5070627Z E       ^
2025-05-07T20:31:42.5071099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.5071541Z 
2025-05-07T20:31:42.5071960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.5072479Z 
2025-05-07T20:31:42.5072586Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.5073007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.5073409Z     T=4096,
2025-05-07T20:31:42.5073598Z     D=7168,
2025-05-07T20:31:42.5073796Z     scale_ub=1200.0,
2025-05-07T20:31:42.5074024Z     contiguous=False,
2025-05-07T20:31:42.5074247Z     compiled=False,
2025-05-07T20:31:42.5074458Z )
2025-05-07T20:31:42.5074777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.5075278Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.5075549Z 
2025-05-07T20:31:42.5075632Z     @given(
2025-05-07T20:31:42.5075860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.5076173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.5076486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.5076811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.5077144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.5077439Z     )
2025-05-07T20:31:42.5077784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.5078229Z     def test_silu_mul_quant(
2025-05-07T20:31:42.5078474Z         self,
2025-05-07T20:31:42.5078666Z         T: int,
2025-05-07T20:31:42.5078873Z         D: int,
2025-05-07T20:31:42.5079099Z         scale_ub: Optional[float],
2025-05-07T20:31:42.5079377Z         contiguous: bool,
2025-05-07T20:31:42.5079616Z         compiled: bool,
2025-05-07T20:31:42.5079847Z     ) -> None:
2025-05-07T20:31:42.5080068Z         torch.manual_seed(2025)
2025-05-07T20:31:42.5080307Z     
2025-05-07T20:31:42.5080587Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.5080936Z     
2025-05-07T20:31:42.5081129Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.5081423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.5081736Z         x = x_sign * x_clamp
2025-05-07T20:31:42.5081982Z         x0 = x[:, :D]
2025-05-07T20:31:42.5082202Z         x1 = x[:, D:]
2025-05-07T20:31:42.5082414Z     
2025-05-07T20:31:42.5082604Z         if contiguous:
2025-05-07T20:31:42.5082844Z             x0 = x0.contiguous()
2025-05-07T20:31:42.5083104Z             x1 = x1.contiguous()
2025-05-07T20:31:42.5083340Z     
2025-05-07T20:31:42.5083542Z         if scale_ub is not None:
2025-05-07T20:31:42.5083820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.5084151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.5084460Z             )
2025-05-07T20:31:42.5084659Z         else:
2025-05-07T20:31:42.5084881Z             scale_ub_tensor = None
2025-05-07T20:31:42.5085126Z     
2025-05-07T20:31:42.5085363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.5085684Z             op = silu_mul_quant
2025-05-07T20:31:42.5085933Z             if compiled:
2025-05-07T20:31:42.5086183Z                 op = torch.compile(op)
2025-05-07T20:31:42.5086578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5086844Z     
2025-05-07T20:31:42.5087046Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.5087330Z 
2025-05-07T20:31:42.5087438Z moe/activation_test.py:117: 
2025-05-07T20:31:42.5087734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5088066Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.5088351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.5089034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.5089717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.5090527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.5091205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.5091860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.5092402Z     kernel = self.compile(
2025-05-07T20:31:42.5092951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.5093606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.5093995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.5094231Z 
2025-05-07T20:31:42.5094439Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c545793f0>
2025-05-07T20:31:42.5095506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.5096878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54500550>}
2025-05-07T20:31:42.5098208Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.5099225Z context = <triton._C.libtriton.ir.context object at 0x7f1c54460db0>
2025-05-07T20:31:42.5099516Z 
2025-05-07T20:31:42.5099687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.5100297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.5100760Z                            module_map=module_map)
2025-05-07T20:31:42.5101129Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.5101489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.5101755Z E       ^
2025-05-07T20:31:42.5102216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.5102679Z 
2025-05-07T20:31:42.5103093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.5103608Z 
2025-05-07T20:31:42.5103717Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.5104127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.5104521Z     T=16384,
2025-05-07T20:31:42.5104716Z     D=7168,
2025-05-07T20:31:42.5104913Z     scale_ub=None,
2025-05-07T20:31:42.5105128Z     contiguous=True,
2025-05-07T20:31:42.5105356Z     compiled=True,
2025-05-07T20:31:42.5105561Z )
2025-05-07T20:31:42.7047299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.7048202Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.7048679Z 
2025-05-07T20:31:42.7048809Z     @given(
2025-05-07T20:31:42.7049196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.7050144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.7050794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.7051383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.7051935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.7052418Z     )
2025-05-07T20:31:42.7053016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.7053777Z     def test_silu_mul_quant(
2025-05-07T20:31:42.7054176Z         self,
2025-05-07T20:31:42.7054493Z         T: int,
2025-05-07T20:31:42.7054807Z         D: int,
2025-05-07T20:31:42.7055163Z         scale_ub: Optional[float],
2025-05-07T20:31:42.7055618Z         contiguous: bool,
2025-05-07T20:31:42.7056013Z         compiled: bool,
2025-05-07T20:31:42.7056378Z     ) -> None:
2025-05-07T20:31:42.7056729Z         torch.manual_seed(2025)
2025-05-07T20:31:42.7057133Z     
2025-05-07T20:31:42.7057586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.7058171Z     
2025-05-07T20:31:42.7058499Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.7058973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.7059499Z         x = x_sign * x_clamp
2025-05-07T20:31:42.7059999Z         x0 = x[:, :D]
2025-05-07T20:31:42.7060349Z         x1 = x[:, D:]
2025-05-07T20:31:42.7060697Z     
2025-05-07T20:31:42.7061000Z         if contiguous:
2025-05-07T20:31:42.7061372Z             x0 = x0.contiguous()
2025-05-07T20:31:42.7061807Z             x1 = x1.contiguous()
2025-05-07T20:31:42.7062211Z     
2025-05-07T20:31:42.7062515Z         if scale_ub is not None:
2025-05-07T20:31:42.7062973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.7063529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.7064046Z             )
2025-05-07T20:31:42.7064353Z         else:
2025-05-07T20:31:42.7064694Z             scale_ub_tensor = None
2025-05-07T20:31:42.7065158Z     
2025-05-07T20:31:42.7065565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.7066106Z             op = silu_mul_quant
2025-05-07T20:31:42.7066531Z             if compiled:
2025-05-07T20:31:42.7066930Z                 op = torch.compile(op)
2025-05-07T20:31:42.7067431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.7067896Z     
2025-05-07T20:31:42.7068203Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.7079519Z 
2025-05-07T20:31:42.7079703Z moe/activation_test.py:117: 
2025-05-07T20:31:42.7080226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.7080801Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.7081288Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.7082285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.7083288Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.7084486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.7085742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.7086696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.7087912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.7089096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.7090396Z     kernel = self.compile(
2025-05-07T20:31:42.7091366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.7092448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.7093144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.7093782Z 
2025-05-07T20:31:42.7094153Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c547ce050>
2025-05-07T20:31:42.7096311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.7098958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54501360>}
2025-05-07T20:31:42.7101499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.7103352Z context = <triton._C.libtriton.ir.context object at 0x7f1c546a0870>
2025-05-07T20:31:42.7103869Z 
2025-05-07T20:31:42.7104152Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.7105097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.7105919Z                            module_map=module_map)
2025-05-07T20:31:42.7106548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.7107159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.7107600Z E       ^
2025-05-07T20:31:42.7108409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.7109238Z 
2025-05-07T20:31:42.7109987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.7110923Z 
2025-05-07T20:31:42.7111104Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.7111759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.7112285Z     T=4096,
2025-05-07T20:31:42.7112540Z     D=5120,
2025-05-07T20:31:42.7112798Z     scale_ub=None,
2025-05-07T20:31:42.7113091Z     contiguous=False,
2025-05-07T20:31:42.7113406Z     compiled=True,
2025-05-07T20:31:42.7113698Z )
2025-05-07T20:31:42.7114105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.7114772Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:42.7115149Z 
2025-05-07T20:31:42.7115267Z     @given(
2025-05-07T20:31:42.7115573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.7115998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.7116435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.7116923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.7117400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.7117814Z     )
2025-05-07T20:31:42.7118298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.7118944Z     def test_silu_mul_quant(
2025-05-07T20:31:42.7119306Z         self,
2025-05-07T20:31:42.7119606Z         T: int,
2025-05-07T20:31:42.7119894Z         D: int,
2025-05-07T20:31:42.7120222Z         scale_ub: Optional[float],
2025-05-07T20:31:42.7120611Z         contiguous: bool,
2025-05-07T20:31:42.7120965Z         compiled: bool,
2025-05-07T20:31:42.7121302Z     ) -> None:
2025-05-07T20:31:42.7121601Z         torch.manual_seed(2025)
2025-05-07T20:31:42.7121981Z     
2025-05-07T20:31:42.7122358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.7122869Z     
2025-05-07T20:31:42.7123172Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.7123638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.7124146Z         x = x_sign * x_clamp
2025-05-07T20:31:42.7124538Z         x0 = x[:, :D]
2025-05-07T20:31:42.7124870Z         x1 = x[:, D:]
2025-05-07T20:31:42.7125198Z     
2025-05-07T20:31:42.7125646Z         if contiguous:
2025-05-07T20:31:42.7126003Z             x0 = x0.contiguous()
2025-05-07T20:31:42.7126417Z             x1 = x1.contiguous()
2025-05-07T20:31:42.7126895Z     
2025-05-07T20:31:42.7127194Z         if scale_ub is not None:
2025-05-07T20:31:42.7127623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.7128162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.7128658Z             )
2025-05-07T20:31:42.7128955Z         else:
2025-05-07T20:31:42.7129272Z             scale_ub_tensor = None
2025-05-07T20:31:42.7129670Z     
2025-05-07T20:31:42.7130029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.7130525Z             op = silu_mul_quant
2025-05-07T20:31:42.7130921Z             if compiled:
2025-05-07T20:31:42.7131314Z                 op = torch.compile(op)
2025-05-07T20:31:42.7131784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.7132220Z     
2025-05-07T20:31:42.7132499Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.7132772Z 
2025-05-07T20:31:42.7132923Z moe/activation_test.py:117: 
2025-05-07T20:31:42.7133390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.7133908Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.7134362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.7135267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.7136198Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.7137335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.7138510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.7139449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.7140699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.7141845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.7142754Z     kernel = self.compile(
2025-05-07T20:31:42.7143690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.7144843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.7145492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.7145881Z 
2025-05-07T20:31:42.7146207Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c544e8ee0>
2025-05-07T20:31:42.7148138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.7150640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54501ea0>}
2025-05-07T20:31:42.7153027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.7154824Z context = <triton._C.libtriton.ir.context object at 0x7f1c54736170>
2025-05-07T20:31:42.7155352Z 
2025-05-07T20:31:42.7155664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.7156511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.7157276Z                            module_map=module_map)
2025-05-07T20:31:42.7157874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.7158446Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.7158865Z E       ^
2025-05-07T20:31:42.7159794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.7160573Z 
2025-05-07T20:31:42.7161402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.7162316Z 
2025-05-07T20:31:43.0856321Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0857153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0857849Z     T=4096,
2025-05-07T20:31:43.0858160Z     D=5120,
2025-05-07T20:31:43.0858471Z     scale_ub=1200.0,
2025-05-07T20:31:43.0858804Z     contiguous=False,
2025-05-07T20:31:43.0859155Z     compiled=False,
2025-05-07T20:31:43.0859469Z )
2025-05-07T20:31:43.0860055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0860909Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.0861387Z 
2025-05-07T20:31:43.0861553Z     @given(
2025-05-07T20:31:43.0861924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0862466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0862985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0863544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0864095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0864574Z     )
2025-05-07T20:31:43.0865171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0865931Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0866333Z         self,
2025-05-07T20:31:43.0866652Z         T: int,
2025-05-07T20:31:43.0866967Z         D: int,
2025-05-07T20:31:43.0867323Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0867772Z         contiguous: bool,
2025-05-07T20:31:43.0868160Z         compiled: bool,
2025-05-07T20:31:43.0868529Z     ) -> None:
2025-05-07T20:31:43.0868885Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0869286Z     
2025-05-07T20:31:43.0869732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0870320Z     
2025-05-07T20:31:43.0870623Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0871104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0871626Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0872025Z         x0 = x[:, :D]
2025-05-07T20:31:43.0872370Z         x1 = x[:, D:]
2025-05-07T20:31:43.0872710Z     
2025-05-07T20:31:43.0873011Z         if contiguous:
2025-05-07T20:31:43.0873380Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0873807Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0874204Z     
2025-05-07T20:31:43.0874509Z         if scale_ub is not None:
2025-05-07T20:31:43.0874963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0875522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0876028Z             )
2025-05-07T20:31:43.0876349Z         else:
2025-05-07T20:31:43.0876695Z             scale_ub_tensor = None
2025-05-07T20:31:43.0877110Z     
2025-05-07T20:31:43.0877496Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0878024Z             op = silu_mul_quant
2025-05-07T20:31:43.0878430Z             if compiled:
2025-05-07T20:31:43.0878836Z                 op = torch.compile(op)
2025-05-07T20:31:43.0879326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0879786Z     
2025-05-07T20:31:43.0880093Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0880383Z 
2025-05-07T20:31:43.0880543Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0881035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0881565Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0882034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0883207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0884820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0886045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0887222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0888406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0889345Z     kernel = self.compile(
2025-05-07T20:31:43.0890625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0891802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0892490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0892890Z 
2025-05-07T20:31:43.0893238Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c5447d630>
2025-05-07T20:31:43.0895209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0897721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54502680>}
2025-05-07T20:31:43.0900265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0902014Z context = <triton._C.libtriton.ir.context object at 0x7f1c54645e70>
2025-05-07T20:31:43.0902526Z 
2025-05-07T20:31:43.0902802Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0903713Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0904536Z                            module_map=module_map)
2025-05-07T20:31:43.0905150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0905745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.0906181Z E       ^
2025-05-07T20:31:43.0906982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0909032Z 
2025-05-07T20:31:43.0909777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0910708Z 
2025-05-07T20:31:43.0910877Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0911590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0912269Z     T=4096,
2025-05-07T20:31:43.0912572Z     D=5120,
2025-05-07T20:31:43.0912885Z     scale_ub=1200.0,
2025-05-07T20:31:43.0913250Z     contiguous=False,
2025-05-07T20:31:43.0913619Z     compiled=True,
2025-05-07T20:31:43.0913957Z )
2025-05-07T20:31:43.0914490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0915365Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.0915882Z 
2025-05-07T20:31:43.0916005Z     @given(
2025-05-07T20:31:43.0916376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0916896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0917409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0917963Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0918514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0918999Z     )
2025-05-07T20:31:43.0919596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0920353Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0920755Z         self,
2025-05-07T20:31:43.0921291Z         T: int,
2025-05-07T20:31:43.0921604Z         D: int,
2025-05-07T20:31:43.0921961Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0922537Z         contiguous: bool,
2025-05-07T20:31:43.0922902Z         compiled: bool,
2025-05-07T20:31:43.0923183Z     ) -> None:
2025-05-07T20:31:43.0923461Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0923783Z     
2025-05-07T20:31:43.0924148Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0924627Z     
2025-05-07T20:31:43.0924876Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0925256Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0925670Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0925999Z         x0 = x[:, :D]
2025-05-07T20:31:43.0926281Z         x1 = x[:, D:]
2025-05-07T20:31:43.0926564Z     
2025-05-07T20:31:43.0926810Z         if contiguous:
2025-05-07T20:31:43.0927115Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0927495Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0927858Z     
2025-05-07T20:31:43.0928134Z         if scale_ub is not None:
2025-05-07T20:31:43.0928534Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0929010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0929435Z             )
2025-05-07T20:31:43.0929703Z         else:
2025-05-07T20:31:43.0930010Z             scale_ub_tensor = None
2025-05-07T20:31:43.0930390Z     
2025-05-07T20:31:43.0930717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0931178Z             op = silu_mul_quant
2025-05-07T20:31:43.0931543Z             if compiled:
2025-05-07T20:31:43.0931896Z                 op = torch.compile(op)
2025-05-07T20:31:43.0932344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0932721Z     
2025-05-07T20:31:43.0933018Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0933241Z 
2025-05-07T20:31:43.0933388Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0933839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0934377Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0934835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0935760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.0936703Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.0937817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0938937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0939780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0941008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0942108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0943005Z     kernel = self.compile(
2025-05-07T20:31:43.0943896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0945036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0945756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0946149Z 
2025-05-07T20:31:43.0946455Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54d2b790>
2025-05-07T20:31:43.0948304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0950762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54503ac0>}
2025-05-07T20:31:43.0953360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0955234Z context = <triton._C.libtriton.ir.context object at 0x7f1c54d3d030>
2025-05-07T20:31:43.0955722Z 
2025-05-07T20:31:43.0956003Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0956910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0957694Z                            module_map=module_map)
2025-05-07T20:31:43.0958288Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0958874Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.0959306Z E       ^
2025-05-07T20:31:43.0960105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0960908Z 
2025-05-07T20:31:43.0961653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0962504Z 
2025-05-07T20:31:43.2235847Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2236666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2237371Z     T=2048,
2025-05-07T20:31:43.2237688Z     D=7168,
2025-05-07T20:31:43.2238004Z     scale_ub=1200.0,
2025-05-07T20:31:43.2238372Z     contiguous=False,
2025-05-07T20:31:43.2238723Z     compiled=False,
2025-05-07T20:31:43.2239050Z )
2025-05-07T20:31:43.2239514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2240355Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.2240837Z 
2025-05-07T20:31:43.2240971Z     @given(
2025-05-07T20:31:43.2241347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2241910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2242429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2242993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2243553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2244041Z     )
2025-05-07T20:31:43.2244641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2245403Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2245808Z         self,
2025-05-07T20:31:43.2246132Z         T: int,
2025-05-07T20:31:43.2246444Z         D: int,
2025-05-07T20:31:43.2246804Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2247263Z         contiguous: bool,
2025-05-07T20:31:43.2247652Z         compiled: bool,
2025-05-07T20:31:43.2248022Z     ) -> None:
2025-05-07T20:31:43.2248377Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2248773Z     
2025-05-07T20:31:43.2249221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2249812Z     
2025-05-07T20:31:43.2250123Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2250617Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2251140Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2251528Z         x0 = x[:, :D]
2025-05-07T20:31:43.2251883Z         x1 = x[:, D:]
2025-05-07T20:31:43.2252222Z     
2025-05-07T20:31:43.2252517Z         if contiguous:
2025-05-07T20:31:43.2252903Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2253338Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2253739Z     
2025-05-07T20:31:43.2254042Z         if scale_ub is not None:
2025-05-07T20:31:43.2254500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2255066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2255576Z             )
2025-05-07T20:31:43.2255890Z         else:
2025-05-07T20:31:43.2256238Z             scale_ub_tensor = None
2025-05-07T20:31:43.2256657Z     
2025-05-07T20:31:43.2257459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2257993Z             op = silu_mul_quant
2025-05-07T20:31:43.2258582Z             if compiled:
2025-05-07T20:31:43.2259003Z                 op = torch.compile(op)
2025-05-07T20:31:43.2259498Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2260058Z     
2025-05-07T20:31:43.2260373Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2260653Z 
2025-05-07T20:31:43.2260823Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2261301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2261834Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2262299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2263473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2264656Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2265583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2266755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2267928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2268867Z     kernel = self.compile(
2025-05-07T20:31:43.2269814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2270982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2271659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2272062Z 
2025-05-07T20:31:43.2272406Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54d2c2b0>
2025-05-07T20:31:43.2274347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2276930Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55826200>}
2025-05-07T20:31:43.2279372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2281108Z context = <triton._C.libtriton.ir.context object at 0x7f1c54cd8cf0>
2025-05-07T20:31:43.2281620Z 
2025-05-07T20:31:43.2281899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2282810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2283625Z                            module_map=module_map)
2025-05-07T20:31:43.2284237Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2284842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2285278Z E       ^
2025-05-07T20:31:43.2286081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2286903Z 
2025-05-07T20:31:43.2287643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2288573Z 
2025-05-07T20:31:43.2288744Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2289454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2290499Z     T=1,
2025-05-07T20:31:43.2290807Z     D=7168,
2025-05-07T20:31:43.2291127Z     scale_ub=None,
2025-05-07T20:31:43.2291499Z     contiguous=True,
2025-05-07T20:31:43.2291862Z     compiled=False,
2025-05-07T20:31:43.2292204Z )
2025-05-07T20:31:43.2293574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2294548Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:43.2295022Z 
2025-05-07T20:31:43.2295149Z     @given(
2025-05-07T20:31:43.2295529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2296054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2307045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2307587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2308092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2308518Z     )
2025-05-07T20:31:43.2309017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2309657Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2310022Z         self,
2025-05-07T20:31:43.2310313Z         T: int,
2025-05-07T20:31:43.2310590Z         D: int,
2025-05-07T20:31:43.2310910Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2311307Z         contiguous: bool,
2025-05-07T20:31:43.2311638Z         compiled: bool,
2025-05-07T20:31:43.2311980Z     ) -> None:
2025-05-07T20:31:43.2312308Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2312644Z     
2025-05-07T20:31:43.2313066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2313551Z     
2025-05-07T20:31:43.2313830Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2314301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2314808Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2315183Z         x0 = x[:, :D]
2025-05-07T20:31:43.2315521Z         x1 = x[:, D:]
2025-05-07T20:31:43.2315849Z     
2025-05-07T20:31:43.2316134Z         if contiguous:
2025-05-07T20:31:43.2316500Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2316914Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2317294Z     
2025-05-07T20:31:43.2317598Z         if scale_ub is not None:
2025-05-07T20:31:43.2318045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2318584Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2319079Z             )
2025-05-07T20:31:43.2319383Z         else:
2025-05-07T20:31:43.2319715Z             scale_ub_tensor = None
2025-05-07T20:31:43.2320107Z     
2025-05-07T20:31:43.2320475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2320983Z             op = silu_mul_quant
2025-05-07T20:31:43.2321372Z             if compiled:
2025-05-07T20:31:43.2321758Z                 op = torch.compile(op)
2025-05-07T20:31:43.2322218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2322649Z     
2025-05-07T20:31:43.2322949Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2323212Z 
2025-05-07T20:31:43.2323375Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2323837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2324372Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2324828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2325999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2327172Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2328069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2329223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2330342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2331237Z     kernel = self.compile(
2025-05-07T20:31:43.2332141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2333248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2334041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2334427Z 
2025-05-07T20:31:43.2334840Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54d2c790>
2025-05-07T20:31:43.2336752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2339156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c484c0>}
2025-05-07T20:31:43.2341602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2343348Z context = <triton._C.libtriton.ir.context object at 0x7f1c542af370>
2025-05-07T20:31:43.2343841Z 
2025-05-07T20:31:43.2344105Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2344984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2345755Z                            module_map=module_map)
2025-05-07T20:31:43.2346348Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2346919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2347334Z E       ^
2025-05-07T20:31:43.2348090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2348872Z 
2025-05-07T20:31:43.2349580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2350458Z 
2025-05-07T20:31:43.2350629Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2351301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2351964Z     T=16384,
2025-05-07T20:31:43.2352253Z     D=7168,
2025-05-07T20:31:43.2352558Z     scale_ub=1200.0,
2025-05-07T20:31:43.2352907Z     contiguous=False,
2025-05-07T20:31:43.2353250Z     compiled=True,
2025-05-07T20:31:43.4983437Z )
2025-05-07T20:31:43.4984125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.4984952Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.4985425Z 
2025-05-07T20:31:43.4985556Z     @given(
2025-05-07T20:31:43.4985921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.4986428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.4986942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.4987500Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.4988024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.4988426Z     )
2025-05-07T20:31:43.4988952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.4989607Z     def test_silu_mul_quant(
2025-05-07T20:31:43.4990425Z         self,
2025-05-07T20:31:43.4990719Z         T: int,
2025-05-07T20:31:43.4991014Z         D: int,
2025-05-07T20:31:43.4991333Z         scale_ub: Optional[float],
2025-05-07T20:31:43.4991755Z         contiguous: bool,
2025-05-07T20:31:43.4992128Z         compiled: bool,
2025-05-07T20:31:43.4992464Z     ) -> None:
2025-05-07T20:31:43.4992774Z         torch.manual_seed(2025)
2025-05-07T20:31:43.4993125Z     
2025-05-07T20:31:43.4993512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.4994022Z     
2025-05-07T20:31:43.4994321Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.4994765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.4995233Z         x = x_sign * x_clamp
2025-05-07T20:31:43.4995616Z         x0 = x[:, :D]
2025-05-07T20:31:43.4996405Z         x1 = x[:, D:]
2025-05-07T20:31:43.4996760Z     
2025-05-07T20:31:43.4997070Z         if contiguous:
2025-05-07T20:31:43.4997644Z             x0 = x0.contiguous()
2025-05-07T20:31:43.4998077Z             x1 = x1.contiguous()
2025-05-07T20:31:43.4998469Z     
2025-05-07T20:31:43.4998777Z         if scale_ub is not None:
2025-05-07T20:31:43.4999227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.4999779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5000284Z             )
2025-05-07T20:31:43.5000588Z         else:
2025-05-07T20:31:43.5000922Z             scale_ub_tensor = None
2025-05-07T20:31:43.5001353Z     
2025-05-07T20:31:43.5001729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5002259Z             op = silu_mul_quant
2025-05-07T20:31:43.5002675Z             if compiled:
2025-05-07T20:31:43.5003074Z                 op = torch.compile(op)
2025-05-07T20:31:43.5003558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5004036Z     
2025-05-07T20:31:43.5004337Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5004634Z 
2025-05-07T20:31:43.5004809Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5005332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5005900Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5006377Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5007358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.5008345Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.5009506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5010714Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5011641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5012820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5013991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5014927Z     kernel = self.compile(
2025-05-07T20:31:43.5015875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5017032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5017729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5018127Z 
2025-05-07T20:31:43.5018450Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c543070d0>
2025-05-07T20:31:43.5020521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5022890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c495a0>}
2025-05-07T20:31:43.5025327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5027067Z context = <triton._C.libtriton.ir.context object at 0x7f1c54296870>
2025-05-07T20:31:43.5027568Z 
2025-05-07T20:31:43.5027832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5028713Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5029514Z                            module_map=module_map)
2025-05-07T20:31:43.5030097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5030688Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5031286Z E       ^
2025-05-07T20:31:43.5032176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5032955Z 
2025-05-07T20:31:43.5033618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5034530Z 
2025-05-07T20:31:43.5034699Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5035393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5036060Z     T=1,
2025-05-07T20:31:43.5036358Z     D=7168,
2025-05-07T20:31:43.5036664Z     scale_ub=None,
2025-05-07T20:31:43.5037006Z     contiguous=False,
2025-05-07T20:31:43.5037374Z     compiled=False,
2025-05-07T20:31:43.5037705Z )
2025-05-07T20:31:43.5038225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5039052Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:43.5039510Z 
2025-05-07T20:31:43.5039630Z     @given(
2025-05-07T20:31:43.5040010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5040513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5041021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5041573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5042118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5042593Z     )
2025-05-07T20:31:43.5043179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5043918Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5044314Z         self,
2025-05-07T20:31:43.5044625Z         T: int,
2025-05-07T20:31:43.5044933Z         D: int,
2025-05-07T20:31:43.5045286Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5045782Z         contiguous: bool,
2025-05-07T20:31:43.5046162Z         compiled: bool,
2025-05-07T20:31:43.5046529Z     ) -> None:
2025-05-07T20:31:43.5046877Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5047274Z     
2025-05-07T20:31:43.5047714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5048294Z     
2025-05-07T20:31:43.5048600Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5049073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5049591Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5049980Z         x0 = x[:, :D]
2025-05-07T20:31:43.5050320Z         x1 = x[:, D:]
2025-05-07T20:31:43.5050663Z     
2025-05-07T20:31:43.5050963Z         if contiguous:
2025-05-07T20:31:43.5051331Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5051761Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5052146Z     
2025-05-07T20:31:43.5052448Z         if scale_ub is not None:
2025-05-07T20:31:43.5052893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5053439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5053964Z             )
2025-05-07T20:31:43.5054265Z         else:
2025-05-07T20:31:43.5054613Z             scale_ub_tensor = None
2025-05-07T20:31:43.5055034Z     
2025-05-07T20:31:43.5055396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5055925Z             op = silu_mul_quant
2025-05-07T20:31:43.5056334Z             if compiled:
2025-05-07T20:31:43.5056720Z                 op = torch.compile(op)
2025-05-07T20:31:43.5057207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5057659Z     
2025-05-07T20:31:43.5057967Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5058251Z 
2025-05-07T20:31:43.5058411Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5058902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5059432Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5059992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5061196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5062703Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5063636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5064811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5065979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5066917Z     kernel = self.compile(
2025-05-07T20:31:43.5067864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5069022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5069691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5070085Z 
2025-05-07T20:31:43.5070452Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c5429bac0>
2025-05-07T20:31:43.5072379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5074733Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c49d80>}
2025-05-07T20:31:43.5077076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5078815Z context = <triton._C.libtriton.ir.context object at 0x7f1c5438adf0>
2025-05-07T20:31:43.5079206Z 
2025-05-07T20:31:43.5079470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5080348Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5081174Z                            module_map=module_map)
2025-05-07T20:31:43.5081787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5082370Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5082804Z E       ^
2025-05-07T20:31:43.5083610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5084414Z 
2025-05-07T20:31:43.5085167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5086086Z 
2025-05-07T20:31:43.5086257Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5086965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5087659Z     T=2048,
2025-05-07T20:31:43.5087959Z     D=7168,
2025-05-07T20:31:43.5088283Z     scale_ub=None,
2025-05-07T20:31:43.5088642Z     contiguous=False,
2025-05-07T20:31:43.5089008Z     compiled=True,
2025-05-07T20:31:43.5089361Z )
2025-05-07T20:31:43.6087885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6088841Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.6089339Z 
2025-05-07T20:31:43.6089469Z     @given(
2025-05-07T20:31:43.6090138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6090655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6091087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6091621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6092181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6092662Z     )
2025-05-07T20:31:43.6093252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6094476Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6094879Z         self,
2025-05-07T20:31:43.6095187Z         T: int,
2025-05-07T20:31:43.6095713Z         D: int,
2025-05-07T20:31:43.6096081Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6096529Z         contiguous: bool,
2025-05-07T20:31:43.6096928Z         compiled: bool,
2025-05-07T20:31:43.6097299Z     ) -> None:
2025-05-07T20:31:43.6097640Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6098044Z     
2025-05-07T20:31:43.6098494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6099065Z     
2025-05-07T20:31:43.6099379Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6099976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6100499Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6100904Z         x0 = x[:, :D]
2025-05-07T20:31:43.6101258Z         x1 = x[:, D:]
2025-05-07T20:31:43.6101598Z     
2025-05-07T20:31:43.6101894Z         if contiguous:
2025-05-07T20:31:43.6102289Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6102721Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6103113Z     
2025-05-07T20:31:43.6103437Z         if scale_ub is not None:
2025-05-07T20:31:43.6103892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6104444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6104966Z             )
2025-05-07T20:31:43.6105281Z         else:
2025-05-07T20:31:43.6105613Z             scale_ub_tensor = None
2025-05-07T20:31:43.6106032Z     
2025-05-07T20:31:43.6106409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6106925Z             op = silu_mul_quant
2025-05-07T20:31:43.6107344Z             if compiled:
2025-05-07T20:31:43.6107748Z                 op = torch.compile(op)
2025-05-07T20:31:43.6108230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6108689Z     
2025-05-07T20:31:43.6109000Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6109274Z 
2025-05-07T20:31:43.6109448Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6109935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6110505Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6110974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6111933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.6112893Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.6113999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6115179Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6116083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6117242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6118361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6119285Z     kernel = self.compile(
2025-05-07T20:31:43.6120246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6121419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6122099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6122499Z 
2025-05-07T20:31:43.6122850Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c543e36d0>
2025-05-07T20:31:43.6124791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6127365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c4af80>}
2025-05-07T20:31:43.6130046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6131895Z context = <triton._C.libtriton.ir.context object at 0x7f1c543be8b0>
2025-05-07T20:31:43.6132355Z 
2025-05-07T20:31:43.6132619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6133522Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6134339Z                            module_map=module_map)
2025-05-07T20:31:43.6134945Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6135545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6135979Z E       ^
2025-05-07T20:31:43.6136790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6137615Z 
2025-05-07T20:31:43.6138368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6139297Z 
2025-05-07T20:31:43.6139470Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6140300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6140995Z     T=4096,
2025-05-07T20:31:43.6141302Z     D=7168,
2025-05-07T20:31:43.6141618Z     scale_ub=None,
2025-05-07T20:31:43.6141965Z     contiguous=False,
2025-05-07T20:31:43.6142334Z     compiled=True,
2025-05-07T20:31:43.6142671Z )
2025-05-07T20:31:43.6143207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6144049Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.6144529Z 
2025-05-07T20:31:43.6144651Z     @given(
2025-05-07T20:31:43.6145032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6145548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6146072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6146630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6147182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6147669Z     )
2025-05-07T20:31:43.6148267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6149030Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6149428Z         self,
2025-05-07T20:31:43.6149748Z         T: int,
2025-05-07T20:31:43.6150069Z         D: int,
2025-05-07T20:31:43.6150418Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6150872Z         contiguous: bool,
2025-05-07T20:31:43.6151273Z         compiled: bool,
2025-05-07T20:31:43.6151633Z     ) -> None:
2025-05-07T20:31:43.6151978Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6152350Z     
2025-05-07T20:31:43.6152738Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6153191Z     
2025-05-07T20:31:43.6153469Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6153840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6154258Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6154587Z         x0 = x[:, :D]
2025-05-07T20:31:43.6154877Z         x1 = x[:, D:]
2025-05-07T20:31:43.6155163Z     
2025-05-07T20:31:43.6155432Z         if contiguous:
2025-05-07T20:31:43.6155732Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6156077Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6156414Z     
2025-05-07T20:31:43.6156665Z         if scale_ub is not None:
2025-05-07T20:31:43.6157012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6157479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6157937Z             )
2025-05-07T20:31:43.6158225Z         else:
2025-05-07T20:31:43.6158689Z             scale_ub_tensor = None
2025-05-07T20:31:43.6159057Z     
2025-05-07T20:31:43.6159395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6159984Z             op = silu_mul_quant
2025-05-07T20:31:43.6160361Z             if compiled:
2025-05-07T20:31:43.6160716Z                 op = torch.compile(op)
2025-05-07T20:31:43.6161121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6171941Z     
2025-05-07T20:31:43.6172264Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6172551Z 
2025-05-07T20:31:43.6172710Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6173195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6173729Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6174188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6175135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.6176080Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.6177221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6178404Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6179307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6180532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6181656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6182556Z     kernel = self.compile(
2025-05-07T20:31:43.6183466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6184567Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6185218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6185603Z 
2025-05-07T20:31:43.6185941Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c5442fa00>
2025-05-07T20:31:43.6187803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6190478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c4be20>}
2025-05-07T20:31:43.6192822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6194581Z context = <triton._C.libtriton.ir.context object at 0x7f1c544243b0>
2025-05-07T20:31:43.6195060Z 
2025-05-07T20:31:43.6195345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6196205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6196984Z                            module_map=module_map)
2025-05-07T20:31:43.6197565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6198135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6198541Z E       ^
2025-05-07T20:31:43.6199314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6200088Z 
2025-05-07T20:31:43.6200801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6201682Z 
2025-05-07T20:31:44.0021316Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.0021745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.0022437Z     T=16384,
2025-05-07T20:31:44.0022712Z     D=5120,
2025-05-07T20:31:44.0022915Z     scale_ub=1200.0,
2025-05-07T20:31:44.0023308Z     contiguous=False,
2025-05-07T20:31:44.0023547Z     compiled=False,
2025-05-07T20:31:44.0023761Z )
2025-05-07T20:31:44.0024074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.0024575Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.0024857Z 
2025-05-07T20:31:44.0024945Z     @given(
2025-05-07T20:31:44.0025174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.0025489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.0025802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.0026124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.0026461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.0026750Z     )
2025-05-07T20:31:44.0027107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.0027547Z     def test_silu_mul_quant(
2025-05-07T20:31:44.0027793Z         self,
2025-05-07T20:31:44.0028002Z         T: int,
2025-05-07T20:31:44.0028225Z         D: int,
2025-05-07T20:31:44.0028441Z         scale_ub: Optional[float],
2025-05-07T20:31:44.0028716Z         contiguous: bool,
2025-05-07T20:31:44.0028958Z         compiled: bool,
2025-05-07T20:31:44.0029187Z     ) -> None:
2025-05-07T20:31:44.0029405Z         torch.manual_seed(2025)
2025-05-07T20:31:44.0029648Z     
2025-05-07T20:31:44.0029925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.0030259Z     
2025-05-07T20:31:44.0030462Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.0030757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.0031066Z         x = x_sign * x_clamp
2025-05-07T20:31:44.0031314Z         x0 = x[:, :D]
2025-05-07T20:31:44.0031529Z         x1 = x[:, D:]
2025-05-07T20:31:44.0031741Z     
2025-05-07T20:31:44.0031941Z         if contiguous:
2025-05-07T20:31:44.0032178Z             x0 = x0.contiguous()
2025-05-07T20:31:44.0032437Z             x1 = x1.contiguous()
2025-05-07T20:31:44.0032683Z     
2025-05-07T20:31:44.0032884Z         if scale_ub is not None:
2025-05-07T20:31:44.0033155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.0033541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.0033884Z             )
2025-05-07T20:31:44.0034087Z         else:
2025-05-07T20:31:44.0034339Z             scale_ub_tensor = None
2025-05-07T20:31:44.0034592Z     
2025-05-07T20:31:44.0034820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.0035137Z             op = silu_mul_quant
2025-05-07T20:31:44.0035394Z             if compiled:
2025-05-07T20:31:44.0035645Z                 op = torch.compile(op)
2025-05-07T20:31:44.0035951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.0036223Z     
2025-05-07T20:31:44.0036415Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.0036584Z 
2025-05-07T20:31:44.0036683Z moe/activation_test.py:117: 
2025-05-07T20:31:44.0036984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.0037314Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.0037593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.0038286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.0038974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.0039505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.0040181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.0040840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.0041368Z     kernel = self.compile(
2025-05-07T20:31:44.0042093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.0042746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.0043138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.0043363Z 
2025-05-07T20:31:44.0043576Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54b618d0>
2025-05-07T20:31:44.0044638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.0046061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b517e0>}
2025-05-07T20:31:44.0047407Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.0048423Z context = <triton._C.libtriton.ir.context object at 0x7f1c54b47d70>
2025-05-07T20:31:44.0048707Z 
2025-05-07T20:31:44.0048881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.0049397Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.0049861Z                            module_map=module_map)
2025-05-07T20:31:44.0050226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.0050582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.0050838Z E       ^
2025-05-07T20:31:44.0051298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.0051740Z 
2025-05-07T20:31:44.0052166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.0052673Z 
2025-05-07T20:31:44.0052779Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.0053190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.0053593Z     T=16384,
2025-05-07T20:31:44.0053787Z     D=5120,
2025-05-07T20:31:44.0053980Z     scale_ub=1200.0,
2025-05-07T20:31:44.0054201Z     contiguous=True,
2025-05-07T20:31:44.0054425Z     compiled=True,
2025-05-07T20:31:44.0054625Z )
2025-05-07T20:31:44.0054944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.0055437Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.0055710Z 
2025-05-07T20:31:44.0055784Z     @given(
2025-05-07T20:31:44.0056018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.0056331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.0056645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.0056979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.0057310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.0057594Z     )
2025-05-07T20:31:44.0057940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.0058379Z     def test_silu_mul_quant(
2025-05-07T20:31:44.0058622Z         self,
2025-05-07T20:31:44.0058809Z         T: int,
2025-05-07T20:31:44.0059007Z         D: int,
2025-05-07T20:31:44.0059226Z         scale_ub: Optional[float],
2025-05-07T20:31:44.0059495Z         contiguous: bool,
2025-05-07T20:31:44.0059735Z         compiled: bool,
2025-05-07T20:31:44.0060059Z     ) -> None:
2025-05-07T20:31:44.0060271Z         torch.manual_seed(2025)
2025-05-07T20:31:44.0060518Z     
2025-05-07T20:31:44.0060789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.0061215Z     
2025-05-07T20:31:44.0061412Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.0061703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.0062080Z         x = x_sign * x_clamp
2025-05-07T20:31:44.0062330Z         x0 = x[:, :D]
2025-05-07T20:31:44.0062547Z         x1 = x[:, D:]
2025-05-07T20:31:44.0062755Z     
2025-05-07T20:31:44.0062941Z         if contiguous:
2025-05-07T20:31:44.0063178Z             x0 = x0.contiguous()
2025-05-07T20:31:44.0063438Z             x1 = x1.contiguous()
2025-05-07T20:31:44.0063667Z     
2025-05-07T20:31:44.0063862Z         if scale_ub is not None:
2025-05-07T20:31:44.0064137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.0064467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.0064773Z             )
2025-05-07T20:31:44.0064967Z         else:
2025-05-07T20:31:44.0065177Z             scale_ub_tensor = None
2025-05-07T20:31:44.0065429Z     
2025-05-07T20:31:44.0065685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.0066024Z             op = silu_mul_quant
2025-05-07T20:31:44.0066275Z             if compiled:
2025-05-07T20:31:44.0066532Z                 op = torch.compile(op)
2025-05-07T20:31:44.0066823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.0067098Z     
2025-05-07T20:31:44.0067297Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.0067461Z 
2025-05-07T20:31:44.0067570Z moe/activation_test.py:117: 
2025-05-07T20:31:44.0067861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.0068196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.0068482Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.0069031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.0069589Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.0070250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.0070941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.0071473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.0072144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.0072814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.0073334Z     kernel = self.compile(
2025-05-07T20:31:44.0073869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.0074524Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.0074916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.0075140Z 
2025-05-07T20:31:44.0075346Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54c1af50>
2025-05-07T20:31:44.0076426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.0077778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b51090>}
2025-05-07T20:31:44.0079115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.0080124Z context = <triton._C.libtriton.ir.context object at 0x7f1c54baf370>
2025-05-07T20:31:44.0080407Z 
2025-05-07T20:31:44.0080573Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.0081275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.0081831Z                            module_map=module_map)
2025-05-07T20:31:44.0082261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.0082612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.0082869Z E       ^
2025-05-07T20:31:44.0083326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.0083774Z 
2025-05-07T20:31:44.0084187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.0084697Z 
2025-05-07T20:31:44.1972605Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.1973103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.1973505Z     T=16384,
2025-05-07T20:31:44.1973695Z     D=5120,
2025-05-07T20:31:44.1973890Z     scale_ub=None,
2025-05-07T20:31:44.1974143Z     contiguous=False,
2025-05-07T20:31:44.1974365Z     compiled=True,
2025-05-07T20:31:44.1974573Z )
2025-05-07T20:31:44.1974908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.1975408Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.1975688Z 
2025-05-07T20:31:44.1975766Z     @given(
2025-05-07T20:31:44.1976000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.1976316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.1976621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.1976956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.1977286Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.1977568Z     )
2025-05-07T20:31:44.1977927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.1978370Z     def test_silu_mul_quant(
2025-05-07T20:31:44.1978617Z         self,
2025-05-07T20:31:44.1978809Z         T: int,
2025-05-07T20:31:44.1979011Z         D: int,
2025-05-07T20:31:44.1979238Z         scale_ub: Optional[float],
2025-05-07T20:31:44.1979512Z         contiguous: bool,
2025-05-07T20:31:44.1979757Z         compiled: bool,
2025-05-07T20:31:44.1980121Z     ) -> None:
2025-05-07T20:31:44.1980343Z         torch.manual_seed(2025)
2025-05-07T20:31:44.1980594Z     
2025-05-07T20:31:44.1980872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.1981211Z     
2025-05-07T20:31:44.1981413Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.1981709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.1982014Z         x = x_sign * x_clamp
2025-05-07T20:31:44.1982256Z         x0 = x[:, :D]
2025-05-07T20:31:44.1982476Z         x1 = x[:, D:]
2025-05-07T20:31:44.1982681Z     
2025-05-07T20:31:44.1982873Z         if contiguous:
2025-05-07T20:31:44.1983111Z             x0 = x0.contiguous()
2025-05-07T20:31:44.1983371Z             x1 = x1.contiguous()
2025-05-07T20:31:44.1983613Z     
2025-05-07T20:31:44.1983814Z         if scale_ub is not None:
2025-05-07T20:31:44.1984094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.1984426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.1984738Z             )
2025-05-07T20:31:44.1984934Z         else:
2025-05-07T20:31:44.1985146Z             scale_ub_tensor = None
2025-05-07T20:31:44.1985406Z     
2025-05-07T20:31:44.1985646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.1985962Z             op = silu_mul_quant
2025-05-07T20:31:44.1986225Z             if compiled:
2025-05-07T20:31:44.1986476Z                 op = torch.compile(op)
2025-05-07T20:31:44.1986770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1987047Z     
2025-05-07T20:31:44.1987245Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.1987410Z 
2025-05-07T20:31:44.1987512Z moe/activation_test.py:117: 
2025-05-07T20:31:44.1988171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1988704Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.1988992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1989563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.1990400Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.1991062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.1991747Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.1992291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.1992974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.1993639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.1994186Z     kernel = self.compile(
2025-05-07T20:31:44.1994739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.1995406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.1995805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1996041Z 
2025-05-07T20:31:44.1996248Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54b634c0>
2025-05-07T20:31:44.1997321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.1998701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b52290>}
2025-05-07T20:31:44.2000053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2001067Z context = <triton._C.libtriton.ir.context object at 0x7f1c4ff908b0>
2025-05-07T20:31:44.2001360Z 
2025-05-07T20:31:44.2001529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2002051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2002520Z                            module_map=module_map)
2025-05-07T20:31:44.2002885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2003242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2003502Z E       ^
2025-05-07T20:31:44.2003959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2004423Z 
2025-05-07T20:31:44.2004844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2005369Z 
2025-05-07T20:31:44.2005477Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.2005941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.2006337Z     T=2048,
2025-05-07T20:31:44.2006531Z     D=5120,
2025-05-07T20:31:44.2006727Z     scale_ub=None,
2025-05-07T20:31:44.2006942Z     contiguous=False,
2025-05-07T20:31:44.2007172Z     compiled=True,
2025-05-07T20:31:44.2007373Z )
2025-05-07T20:31:44.3038515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3039044Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.3039326Z 
2025-05-07T20:31:44.3039409Z     @given(
2025-05-07T20:31:44.3039646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3040442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3040806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3041301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3041635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3041955Z     )
2025-05-07T20:31:44.3042302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3042748Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3042995Z         self,
2025-05-07T20:31:44.3043188Z         T: int,
2025-05-07T20:31:44.3043393Z         D: int,
2025-05-07T20:31:44.3043617Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3043886Z         contiguous: bool,
2025-05-07T20:31:44.3044131Z         compiled: bool,
2025-05-07T20:31:44.3044364Z     ) -> None:
2025-05-07T20:31:44.3044579Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3044827Z     
2025-05-07T20:31:44.3045101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3045445Z     
2025-05-07T20:31:44.3045643Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3045950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3046261Z         x = x_sign * x_clamp
2025-05-07T20:31:44.3046502Z         x0 = x[:, :D]
2025-05-07T20:31:44.3046724Z         x1 = x[:, D:]
2025-05-07T20:31:44.3046935Z     
2025-05-07T20:31:44.3047119Z         if contiguous:
2025-05-07T20:31:44.3047356Z             x0 = x0.contiguous()
2025-05-07T20:31:44.3047621Z             x1 = x1.contiguous()
2025-05-07T20:31:44.3047855Z     
2025-05-07T20:31:44.3048053Z         if scale_ub is not None:
2025-05-07T20:31:44.3048328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.3048663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.3048976Z             )
2025-05-07T20:31:44.3049169Z         else:
2025-05-07T20:31:44.3049381Z             scale_ub_tensor = None
2025-05-07T20:31:44.3049642Z     
2025-05-07T20:31:44.3049879Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3050188Z             op = silu_mul_quant
2025-05-07T20:31:44.3050446Z             if compiled:
2025-05-07T20:31:44.3050707Z                 op = torch.compile(op)
2025-05-07T20:31:44.3051009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3051283Z     
2025-05-07T20:31:44.3051481Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.3051646Z 
2025-05-07T20:31:44.3051756Z moe/activation_test.py:117: 
2025-05-07T20:31:44.3052050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3052384Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.3052671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3053225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.3053785Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.3054449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.3055154Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.3055688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.3056367Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.3057034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.3057565Z     kernel = self.compile(
2025-05-07T20:31:44.3058101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.3058767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.3059165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3059394Z 
2025-05-07T20:31:44.3059695Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54b9d420>
2025-05-07T20:31:44.3060944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.3062338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b52170>}
2025-05-07T20:31:44.3063689Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.3064712Z context = <triton._C.libtriton.ir.context object at 0x7f1c4ff89f70>
2025-05-07T20:31:44.3065003Z 
2025-05-07T20:31:44.3065172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.3065717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.3066191Z                            module_map=module_map)
2025-05-07T20:31:44.3066553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.3066913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.3067181Z E       ^
2025-05-07T20:31:44.3067648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.3068095Z 
2025-05-07T20:31:44.3068509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.3069026Z 
2025-05-07T20:31:44.3069133Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3069564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3069975Z     T=2048,
2025-05-07T20:31:44.3070165Z     D=5120,
2025-05-07T20:31:44.3070363Z     scale_ub=1200.0,
2025-05-07T20:31:44.3070598Z     contiguous=False,
2025-05-07T20:31:44.3070822Z     compiled=True,
2025-05-07T20:31:44.3071040Z )
2025-05-07T20:31:44.3071364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3080524Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.3080816Z 
2025-05-07T20:31:44.3080907Z     @given(
2025-05-07T20:31:44.3081148Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3081473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3081791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3082126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3082469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3082766Z     )
2025-05-07T20:31:44.3083121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3083587Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3083848Z         self,
2025-05-07T20:31:44.3084057Z         T: int,
2025-05-07T20:31:44.3084264Z         D: int,
2025-05-07T20:31:44.3084496Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3084787Z         contiguous: bool,
2025-05-07T20:31:44.3085037Z         compiled: bool,
2025-05-07T20:31:44.3085276Z     ) -> None:
2025-05-07T20:31:44.3085508Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3085757Z     
2025-05-07T20:31:44.3086042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3086395Z     
2025-05-07T20:31:44.3086593Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3086895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3087215Z         x = x_sign * x_clamp
2025-05-07T20:31:44.3087461Z         x0 = x[:, :D]
2025-05-07T20:31:44.3087690Z         x1 = x[:, D:]
2025-05-07T20:31:44.3087909Z     
2025-05-07T20:31:44.3088100Z         if contiguous:
2025-05-07T20:31:44.3088465Z             x0 = x0.contiguous()
2025-05-07T20:31:44.3088740Z             x1 = x1.contiguous()
2025-05-07T20:31:44.3088981Z     
2025-05-07T20:31:44.3089257Z         if scale_ub is not None:
2025-05-07T20:31:44.3089544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.3090191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.3090510Z             )
2025-05-07T20:31:44.3090718Z         else:
2025-05-07T20:31:44.3090944Z             scale_ub_tensor = None
2025-05-07T20:31:44.3091204Z     
2025-05-07T20:31:44.3091440Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.3091765Z             op = silu_mul_quant
2025-05-07T20:31:44.3092015Z             if compiled:
2025-05-07T20:31:44.3092268Z                 op = torch.compile(op)
2025-05-07T20:31:44.3092575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3092850Z     
2025-05-07T20:31:44.3093058Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.3093234Z 
2025-05-07T20:31:44.3093346Z moe/activation_test.py:117: 
2025-05-07T20:31:44.3093651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3093997Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.3094295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.3094864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.3095428Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.3096160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.3096870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.3097411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.3098103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.3098781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.3099322Z     kernel = self.compile(
2025-05-07T20:31:44.3099970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.3100641Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.3101049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.3101280Z 
2025-05-07T20:31:44.3101497Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4ff7b580>
2025-05-07T20:31:44.3102570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.3103942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b53880>}
2025-05-07T20:31:44.3105314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.3106340Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fff25f0>
2025-05-07T20:31:44.3106629Z 
2025-05-07T20:31:44.3106809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.3107328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.3107794Z                            module_map=module_map)
2025-05-07T20:31:44.3108168Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.3108520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.3108786Z E       ^
2025-05-07T20:31:44.3109244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.3110453Z 
2025-05-07T20:31:44.3110977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.3111487Z 
2025-05-07T20:31:44.5003525Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5004003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5004415Z     T=4096,
2025-05-07T20:31:44.5004598Z     D=5120,
2025-05-07T20:31:44.5004794Z     scale_ub=1200.0,
2025-05-07T20:31:44.5005022Z     contiguous=True,
2025-05-07T20:31:44.5005244Z     compiled=True,
2025-05-07T20:31:44.5005452Z )
2025-05-07T20:31:44.5005774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5006267Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.5006541Z 
2025-05-07T20:31:44.5006619Z     @given(
2025-05-07T20:31:44.5006884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5007190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5007512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5007846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5008176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5008455Z     )
2025-05-07T20:31:44.5008820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5009261Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5009580Z         self,
2025-05-07T20:31:44.5009838Z         T: int,
2025-05-07T20:31:44.5010110Z         D: int,
2025-05-07T20:31:44.5010340Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5010633Z         contiguous: bool,
2025-05-07T20:31:44.5010934Z         compiled: bool,
2025-05-07T20:31:44.5011166Z     ) -> None:
2025-05-07T20:31:44.5011395Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5011655Z     
2025-05-07T20:31:44.5011942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5012503Z     
2025-05-07T20:31:44.5012717Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.5013020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.5013348Z         x = x_sign * x_clamp
2025-05-07T20:31:44.5013601Z         x0 = x[:, :D]
2025-05-07T20:31:44.5013829Z         x1 = x[:, D:]
2025-05-07T20:31:44.5014050Z     
2025-05-07T20:31:44.5014249Z         if contiguous:
2025-05-07T20:31:44.5014487Z             x0 = x0.contiguous()
2025-05-07T20:31:44.5014763Z             x1 = x1.contiguous()
2025-05-07T20:31:44.5015015Z     
2025-05-07T20:31:44.5015215Z         if scale_ub is not None:
2025-05-07T20:31:44.5015502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.5015850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.5016167Z             )
2025-05-07T20:31:44.5016374Z         else:
2025-05-07T20:31:44.5016612Z             scale_ub_tensor = None
2025-05-07T20:31:44.5016873Z     
2025-05-07T20:31:44.5017123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.5017447Z             op = silu_mul_quant
2025-05-07T20:31:44.5017706Z             if compiled:
2025-05-07T20:31:44.5017969Z                 op = torch.compile(op)
2025-05-07T20:31:44.5018282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5018570Z     
2025-05-07T20:31:44.5018769Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.5018946Z 
2025-05-07T20:31:44.5019053Z moe/activation_test.py:117: 
2025-05-07T20:31:44.5019363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5019706Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.5020107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5020693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.5021258Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.5022296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.5023004Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.5023554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.5024241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.5024917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.5025491Z     kernel = self.compile(
2025-05-07T20:31:44.5026276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.5026947Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.5027363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5027603Z 
2025-05-07T20:31:44.5027830Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fe94280>
2025-05-07T20:31:44.5028917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.5030326Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe44940>}
2025-05-07T20:31:44.5031673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.5032703Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fed4fb0>
2025-05-07T20:31:44.5032995Z 
2025-05-07T20:31:44.5033179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.5033710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.5034187Z                            module_map=module_map)
2025-05-07T20:31:44.5034565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.5034932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.5035197Z E       ^
2025-05-07T20:31:44.5035678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.5036129Z 
2025-05-07T20:31:44.5036552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.5037076Z 
2025-05-07T20:31:44.5037184Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5037607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5038024Z     T=128,
2025-05-07T20:31:44.5038225Z     D=5120,
2025-05-07T20:31:44.5038425Z     scale_ub=1200.0,
2025-05-07T20:31:44.5038670Z     contiguous=False,
2025-05-07T20:31:44.5038910Z     compiled=True,
2025-05-07T20:31:44.5039119Z )
2025-05-07T20:31:44.6185016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6185668Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.6185978Z 
2025-05-07T20:31:44.6186059Z     @given(
2025-05-07T20:31:44.6186300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6186609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6186920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6187251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6187580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6187878Z     )
2025-05-07T20:31:44.6188236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6189024Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6189262Z         self,
2025-05-07T20:31:44.6189625Z         T: int,
2025-05-07T20:31:44.6190157Z         D: int,
2025-05-07T20:31:44.6190396Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6190675Z         contiguous: bool,
2025-05-07T20:31:44.6190925Z         compiled: bool,
2025-05-07T20:31:44.6191153Z     ) -> None:
2025-05-07T20:31:44.6191379Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6191633Z     
2025-05-07T20:31:44.6191908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6192259Z     
2025-05-07T20:31:44.6192463Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6192762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6193083Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6193333Z         x0 = x[:, :D]
2025-05-07T20:31:44.6193551Z         x1 = x[:, D:]
2025-05-07T20:31:44.6193781Z     
2025-05-07T20:31:44.6193981Z         if contiguous:
2025-05-07T20:31:44.6194218Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6194496Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6194745Z     
2025-05-07T20:31:44.6194942Z         if scale_ub is not None:
2025-05-07T20:31:44.6195224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6195567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6195885Z             )
2025-05-07T20:31:44.6196080Z         else:
2025-05-07T20:31:44.6196300Z             scale_ub_tensor = None
2025-05-07T20:31:44.6196557Z     
2025-05-07T20:31:44.6196790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6197108Z             op = silu_mul_quant
2025-05-07T20:31:44.6197366Z             if compiled:
2025-05-07T20:31:44.6197618Z                 op = torch.compile(op)
2025-05-07T20:31:44.6197923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6198203Z     
2025-05-07T20:31:44.6198405Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6198579Z 
2025-05-07T20:31:44.6198682Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6198988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6199326Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6199609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6200173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.6200747Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.6201405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6202097Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6202636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6203321Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6203990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6204525Z     kernel = self.compile(
2025-05-07T20:31:44.6205071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6205721Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6206122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6206358Z 
2025-05-07T20:31:44.6206568Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fef81f0>
2025-05-07T20:31:44.6207640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6209157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe451b0>}
2025-05-07T20:31:44.6210599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6211621Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fd06330>
2025-05-07T20:31:44.6211908Z 
2025-05-07T20:31:44.6212086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6212611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6213077Z                            module_map=module_map)
2025-05-07T20:31:44.6213450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6213812Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6214082Z E       ^
2025-05-07T20:31:44.6214565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6215027Z 
2025-05-07T20:31:44.6215443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6215959Z 
2025-05-07T20:31:44.6216071Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6216486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6216892Z     T=16384,
2025-05-07T20:31:44.6217094Z     D=7168,
2025-05-07T20:31:44.6217292Z     scale_ub=1200.0,
2025-05-07T20:31:44.6217524Z     contiguous=True,
2025-05-07T20:31:44.6217753Z     compiled=True,
2025-05-07T20:31:44.6217970Z )
2025-05-07T20:31:44.6218291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6218793Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.6219076Z 
2025-05-07T20:31:44.6219164Z     @given(
2025-05-07T20:31:44.6219399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6219730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6220117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6220448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6220788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6221081Z     )
2025-05-07T20:31:44.6221441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6221886Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6222137Z         self,
2025-05-07T20:31:44.6222339Z         T: int,
2025-05-07T20:31:44.6222543Z         D: int,
2025-05-07T20:31:44.6222771Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6223047Z         contiguous: bool,
2025-05-07T20:31:44.6223291Z         compiled: bool,
2025-05-07T20:31:44.6223522Z     ) -> None:
2025-05-07T20:31:44.6223754Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6223998Z     
2025-05-07T20:31:44.6224283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6224636Z     
2025-05-07T20:31:44.6224832Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6225130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6225444Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6225689Z         x0 = x[:, :D]
2025-05-07T20:31:44.6225913Z         x1 = x[:, D:]
2025-05-07T20:31:44.6226131Z     
2025-05-07T20:31:44.6226319Z         if contiguous:
2025-05-07T20:31:44.6226559Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6226827Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6227072Z     
2025-05-07T20:31:44.6227269Z         if scale_ub is not None:
2025-05-07T20:31:44.6227551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6227896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6228298Z             )
2025-05-07T20:31:44.6228502Z         else:
2025-05-07T20:31:44.6228726Z             scale_ub_tensor = None
2025-05-07T20:31:44.6228980Z     
2025-05-07T20:31:44.6229296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6229622Z             op = silu_mul_quant
2025-05-07T20:31:44.6229877Z             if compiled:
2025-05-07T20:31:44.6230136Z                 op = torch.compile(op)
2025-05-07T20:31:44.6230441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6230719Z     
2025-05-07T20:31:44.6230923Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6231092Z 
2025-05-07T20:31:44.6231208Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6231513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6231845Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6232137Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6232699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.6233270Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.6233941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6234637Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6235182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6235860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6236527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6237064Z     kernel = self.compile(
2025-05-07T20:31:44.6237603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6238263Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6238671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6238901Z 
2025-05-07T20:31:44.6239121Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fd03e20>
2025-05-07T20:31:44.6240191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6241568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe457e0>}
2025-05-07T20:31:44.6242907Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6243930Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fdc77b0>
2025-05-07T20:31:44.6244221Z 
2025-05-07T20:31:44.6244400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6244928Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6245403Z                            module_map=module_map)
2025-05-07T20:31:44.6245775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6246129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6246397Z E       ^
2025-05-07T20:31:44.6246868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6247316Z 
2025-05-07T20:31:44.6247739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6248257Z 
2025-05-07T20:31:44.9741303Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9741789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9742751Z     T=16384,
2025-05-07T20:31:44.9742969Z     D=5120,
2025-05-07T20:31:44.9743330Z     scale_ub=1200.0,
2025-05-07T20:31:44.9743563Z     contiguous=True,
2025-05-07T20:31:44.9743793Z     compiled=False,
2025-05-07T20:31:44.9744042Z )
2025-05-07T20:31:44.9744363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9744872Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.9745164Z 
2025-05-07T20:31:44.9745248Z     @given(
2025-05-07T20:31:44.9745489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9745802Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9746112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9746448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9746776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9747076Z     )
2025-05-07T20:31:44.9747434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9747890Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9748148Z         self,
2025-05-07T20:31:44.9748352Z         T: int,
2025-05-07T20:31:44.9748562Z         D: int,
2025-05-07T20:31:44.9748785Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9749065Z         contiguous: bool,
2025-05-07T20:31:44.9749316Z         compiled: bool,
2025-05-07T20:31:44.9749545Z     ) -> None:
2025-05-07T20:31:44.9749774Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9750027Z     
2025-05-07T20:31:44.9750303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9750655Z     
2025-05-07T20:31:44.9750863Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9751157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9751473Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9751741Z         x0 = x[:, :D]
2025-05-07T20:31:44.9751975Z         x1 = x[:, D:]
2025-05-07T20:31:44.9752185Z     
2025-05-07T20:31:44.9752382Z         if contiguous:
2025-05-07T20:31:44.9752630Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9752892Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9753145Z     
2025-05-07T20:31:44.9753351Z         if scale_ub is not None:
2025-05-07T20:31:44.9753626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9762616Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9762947Z             )
2025-05-07T20:31:44.9763157Z         else:
2025-05-07T20:31:44.9763387Z             scale_ub_tensor = None
2025-05-07T20:31:44.9763646Z     
2025-05-07T20:31:44.9763898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9764229Z             op = silu_mul_quant
2025-05-07T20:31:44.9764489Z             if compiled:
2025-05-07T20:31:44.9764754Z                 op = torch.compile(op)
2025-05-07T20:31:44.9765062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9765359Z     
2025-05-07T20:31:44.9765558Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9765735Z 
2025-05-07T20:31:44.9765844Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9766156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9766496Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9766792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9767490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9768183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9768725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9769420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9770088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9770746Z     kernel = self.compile(
2025-05-07T20:31:44.9771372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9772043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9772448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9772679Z 
2025-05-07T20:31:44.9772892Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fd00820>
2025-05-07T20:31:44.9773971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9775347Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe46950>}
2025-05-07T20:31:44.9776713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9777730Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fd277b0>
2025-05-07T20:31:44.9778025Z 
2025-05-07T20:31:44.9778198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9778725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9779200Z                            module_map=module_map)
2025-05-07T20:31:44.9779570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9780084Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9780351Z E       ^
2025-05-07T20:31:44.9780809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9781262Z 
2025-05-07T20:31:44.9781679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9782200Z 
2025-05-07T20:31:44.9782308Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9782720Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9783114Z     T=1,
2025-05-07T20:31:44.9783303Z     D=7168,
2025-05-07T20:31:44.9783502Z     scale_ub=1200.0,
2025-05-07T20:31:44.9783729Z     contiguous=False,
2025-05-07T20:31:44.9783959Z     compiled=False,
2025-05-07T20:31:44.9784172Z )
2025-05-07T20:31:44.9784487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9784977Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.9785254Z 
2025-05-07T20:31:44.9785334Z     @given(
2025-05-07T20:31:44.9785576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9785894Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9786222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9786555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9786882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9787181Z     )
2025-05-07T20:31:44.9787537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9787975Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9788228Z         self,
2025-05-07T20:31:44.9788431Z         T: int,
2025-05-07T20:31:44.9788628Z         D: int,
2025-05-07T20:31:44.9788855Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9789134Z         contiguous: bool,
2025-05-07T20:31:44.9789384Z         compiled: bool,
2025-05-07T20:31:44.9789609Z     ) -> None:
2025-05-07T20:31:44.9790160Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9790412Z     
2025-05-07T20:31:44.9790691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9791204Z     
2025-05-07T20:31:44.9791416Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9791809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9792132Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9792369Z         x0 = x[:, :D]
2025-05-07T20:31:44.9792583Z         x1 = x[:, D:]
2025-05-07T20:31:44.9792785Z     
2025-05-07T20:31:44.9792970Z         if contiguous:
2025-05-07T20:31:44.9793208Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9793472Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9793707Z     
2025-05-07T20:31:44.9793903Z         if scale_ub is not None:
2025-05-07T20:31:44.9794179Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9794509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9794821Z             )
2025-05-07T20:31:44.9795018Z         else:
2025-05-07T20:31:44.9795230Z             scale_ub_tensor = None
2025-05-07T20:31:44.9795496Z     
2025-05-07T20:31:44.9795734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9796053Z             op = silu_mul_quant
2025-05-07T20:31:44.9796302Z             if compiled:
2025-05-07T20:31:44.9796554Z                 op = torch.compile(op)
2025-05-07T20:31:44.9796855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9797127Z     
2025-05-07T20:31:44.9797324Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9797487Z 
2025-05-07T20:31:44.9797594Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9797889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9798219Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9798510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9799191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.9799878Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.9800431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.9801113Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.9801766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.9802300Z     kernel = self.compile(
2025-05-07T20:31:44.9802843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.9803496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.9803886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9804117Z 
2025-05-07T20:31:44.9804325Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fd1d2d0>
2025-05-07T20:31:44.9805401Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.9806781Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe47ac0>}
2025-05-07T20:31:44.9808104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.9809116Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fc325f0>
2025-05-07T20:31:44.9809405Z 
2025-05-07T20:31:44.9809571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.9810091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.9810548Z                            module_map=module_map)
2025-05-07T20:31:44.9811003Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.9811432Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.9811688Z E       ^
2025-05-07T20:31:44.9812152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.9812600Z 
2025-05-07T20:31:44.9813012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.9813516Z 
2025-05-07T20:31:45.1726720Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1727218Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1727644Z     T=4096,
2025-05-07T20:31:45.1727841Z     D=7168,
2025-05-07T20:31:45.1728048Z     scale_ub=1200.0,
2025-05-07T20:31:45.1728283Z     contiguous=False,
2025-05-07T20:31:45.1728512Z     compiled=True,
2025-05-07T20:31:45.1728762Z )
2025-05-07T20:31:45.1729089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.1729597Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.1729891Z 
2025-05-07T20:31:45.1729995Z     @given(
2025-05-07T20:31:45.1730234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.1730556Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.1730866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.1731203Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.1731539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.1731826Z     )
2025-05-07T20:31:45.1732183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.1732630Z     def test_silu_mul_quant(
2025-05-07T20:31:45.1732881Z         self,
2025-05-07T20:31:45.1733078Z         T: int,
2025-05-07T20:31:45.1733285Z         D: int,
2025-05-07T20:31:45.1733517Z         scale_ub: Optional[float],
2025-05-07T20:31:45.1733787Z         contiguous: bool,
2025-05-07T20:31:45.1734038Z         compiled: bool,
2025-05-07T20:31:45.1734274Z     ) -> None:
2025-05-07T20:31:45.1734492Z         torch.manual_seed(2025)
2025-05-07T20:31:45.1734747Z     
2025-05-07T20:31:45.1735029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.1735374Z     
2025-05-07T20:31:45.1735578Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.1735877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.1736189Z         x = x_sign * x_clamp
2025-05-07T20:31:45.1736439Z         x0 = x[:, :D]
2025-05-07T20:31:45.1736668Z         x1 = x[:, D:]
2025-05-07T20:31:45.1736879Z     
2025-05-07T20:31:45.1737076Z         if contiguous:
2025-05-07T20:31:45.1737323Z             x0 = x0.contiguous()
2025-05-07T20:31:45.1737599Z             x1 = x1.contiguous()
2025-05-07T20:31:45.1737843Z     
2025-05-07T20:31:45.1738048Z         if scale_ub is not None:
2025-05-07T20:31:45.1738335Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.1738680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.1738997Z             )
2025-05-07T20:31:45.1739203Z         else:
2025-05-07T20:31:45.1739415Z             scale_ub_tensor = None
2025-05-07T20:31:45.1739674Z     
2025-05-07T20:31:45.1739996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.1740309Z             op = silu_mul_quant
2025-05-07T20:31:45.1740568Z             if compiled:
2025-05-07T20:31:45.1740823Z                 op = torch.compile(op)
2025-05-07T20:31:45.1741117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1741395Z     
2025-05-07T20:31:45.1741592Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.1741757Z 
2025-05-07T20:31:45.1741865Z moe/activation_test.py:117: 
2025-05-07T20:31:45.1742161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1742864Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.1743150Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1743869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.1744433Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.1745096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.1745779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.1746317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.1747002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.1747662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.1748191Z     kernel = self.compile(
2025-05-07T20:31:45.1748744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.1749414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.1749814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1750044Z 
2025-05-07T20:31:45.1750251Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fc866b0>
2025-05-07T20:31:45.1751323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.1752721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc10550>}
2025-05-07T20:31:45.1754070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.1755091Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fcacf30>
2025-05-07T20:31:45.1755389Z 
2025-05-07T20:31:45.1755557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.1756130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.1756599Z                            module_map=module_map)
2025-05-07T20:31:45.1756960Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.1757318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.1757576Z E       ^
2025-05-07T20:31:45.1758038Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1758494Z 
2025-05-07T20:31:45.1758908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.1759439Z 
2025-05-07T20:31:45.1759552Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1759970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1760367Z     T=128,
2025-05-07T20:31:45.1760565Z     D=7168,
2025-05-07T20:31:45.1760766Z     scale_ub=1200.0,
2025-05-07T20:31:45.1760989Z     contiguous=False,
2025-05-07T20:31:45.1761218Z     compiled=True,
2025-05-07T20:31:45.1761425Z )
2025-05-07T20:31:45.2793009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.2793565Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.2793839Z 
2025-05-07T20:31:45.2793921Z     @given(
2025-05-07T20:31:45.2794159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.2794474Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.2794773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.2795452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.2795917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.2796203Z     )
2025-05-07T20:31:45.2796556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.2797001Z     def test_silu_mul_quant(
2025-05-07T20:31:45.2797246Z         self,
2025-05-07T20:31:45.2797440Z         T: int,
2025-05-07T20:31:45.2797638Z         D: int,
2025-05-07T20:31:45.2797863Z         scale_ub: Optional[float],
2025-05-07T20:31:45.2798133Z         contiguous: bool,
2025-05-07T20:31:45.2798375Z         compiled: bool,
2025-05-07T20:31:45.2798602Z     ) -> None:
2025-05-07T20:31:45.2798817Z         torch.manual_seed(2025)
2025-05-07T20:31:45.2799061Z     
2025-05-07T20:31:45.2799339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.2799678Z     
2025-05-07T20:31:45.2799875Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.2800176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.2800486Z         x = x_sign * x_clamp
2025-05-07T20:31:45.2800729Z         x0 = x[:, :D]
2025-05-07T20:31:45.2800947Z         x1 = x[:, D:]
2025-05-07T20:31:45.2801151Z     
2025-05-07T20:31:45.2801339Z         if contiguous:
2025-05-07T20:31:45.2801583Z             x0 = x0.contiguous()
2025-05-07T20:31:45.2801839Z             x1 = x1.contiguous()
2025-05-07T20:31:45.2802080Z     
2025-05-07T20:31:45.2802280Z         if scale_ub is not None:
2025-05-07T20:31:45.2802548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.2802886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.2803194Z             )
2025-05-07T20:31:45.2803390Z         else:
2025-05-07T20:31:45.2803601Z             scale_ub_tensor = None
2025-05-07T20:31:45.2803858Z     
2025-05-07T20:31:45.2804093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.2804413Z             op = silu_mul_quant
2025-05-07T20:31:45.2804673Z             if compiled:
2025-05-07T20:31:45.2804931Z                 op = torch.compile(op)
2025-05-07T20:31:45.2805227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.2805501Z     
2025-05-07T20:31:45.2805703Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.2805868Z 
2025-05-07T20:31:45.2805970Z moe/activation_test.py:117: 
2025-05-07T20:31:45.2806274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.2806610Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.2806894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.2807454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.2808015Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.2808673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.2809359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.2809899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.2810577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.2811243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.2811765Z     kernel = self.compile(
2025-05-07T20:31:45.2812305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.2812961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.2813352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.2813589Z 
2025-05-07T20:31:45.2813796Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fcb7ac0>
2025-05-07T20:31:45.2815023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.2816464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc10f70>}
2025-05-07T20:31:45.2817792Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.2818802Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fba23f0>
2025-05-07T20:31:45.2819092Z 
2025-05-07T20:31:45.2819261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.2819782Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.2820344Z                            module_map=module_map)
2025-05-07T20:31:45.2820718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.2821077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.2821341Z E       ^
2025-05-07T20:31:45.2821807Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.2822259Z 
2025-05-07T20:31:45.2822674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.2823195Z 
2025-05-07T20:31:45.2823302Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.2823722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.2824120Z     T=2048,
2025-05-07T20:31:45.2824318Z     D=7168,
2025-05-07T20:31:45.2824524Z     scale_ub=None,
2025-05-07T20:31:45.2824763Z     contiguous=True,
2025-05-07T20:31:45.2824994Z     compiled=True,
2025-05-07T20:31:45.2825210Z )
2025-05-07T20:31:45.2825546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.2826036Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.2826315Z 
2025-05-07T20:31:45.2826395Z     @given(
2025-05-07T20:31:45.2826637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.2826946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.2827257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.2827595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.2827931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.2828218Z     )
2025-05-07T20:31:45.2828576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.2829025Z     def test_silu_mul_quant(
2025-05-07T20:31:45.2829269Z         self,
2025-05-07T20:31:45.2829477Z         T: int,
2025-05-07T20:31:45.2829682Z         D: int,
2025-05-07T20:31:45.2829905Z         scale_ub: Optional[float],
2025-05-07T20:31:45.2830189Z         contiguous: bool,
2025-05-07T20:31:45.2830434Z         compiled: bool,
2025-05-07T20:31:45.2830658Z     ) -> None:
2025-05-07T20:31:45.2830881Z         torch.manual_seed(2025)
2025-05-07T20:31:45.2831133Z     
2025-05-07T20:31:45.2831405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.2831751Z     
2025-05-07T20:31:45.2831951Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.2832249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.2832556Z         x = x_sign * x_clamp
2025-05-07T20:31:45.2832803Z         x0 = x[:, :D]
2025-05-07T20:31:45.2833024Z         x1 = x[:, D:]
2025-05-07T20:31:45.2833231Z     
2025-05-07T20:31:45.2833427Z         if contiguous:
2025-05-07T20:31:45.2833667Z             x0 = x0.contiguous()
2025-05-07T20:31:45.2833928Z             x1 = x1.contiguous()
2025-05-07T20:31:45.2834266Z     
2025-05-07T20:31:45.2834476Z         if scale_ub is not None:
2025-05-07T20:31:45.2834823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.2835173Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.2835487Z             )
2025-05-07T20:31:45.2835682Z         else:
2025-05-07T20:31:45.2835900Z             scale_ub_tensor = None
2025-05-07T20:31:45.2836158Z     
2025-05-07T20:31:45.2836394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.2836717Z             op = silu_mul_quant
2025-05-07T20:31:45.2836976Z             if compiled:
2025-05-07T20:31:45.2837233Z                 op = torch.compile(op)
2025-05-07T20:31:45.2837533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.2837814Z     
2025-05-07T20:31:45.2838013Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.2838179Z 
2025-05-07T20:31:45.2838283Z moe/activation_test.py:117: 
2025-05-07T20:31:45.2838586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.2838933Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.2839227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.2839793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.2840363Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.2841026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.2841711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.2842255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.2842939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.2843604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.2844149Z     kernel = self.compile(
2025-05-07T20:31:45.2844702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.2845366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.2845766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.2846011Z 
2025-05-07T20:31:45.2846262Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fcb0880>
2025-05-07T20:31:45.2847345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.2848716Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc11bd0>}
2025-05-07T20:31:45.2850067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.2851086Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fb73bf0>
2025-05-07T20:31:45.2851403Z 
2025-05-07T20:31:45.2851576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.2852101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.2852572Z                            module_map=module_map)
2025-05-07T20:31:45.2852937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.2853300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.2853570Z E       ^
2025-05-07T20:31:45.2854036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.2854613Z 
2025-05-07T20:31:45.2855030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.2855630Z 
2025-05-07T20:31:45.3634765Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3635589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3636323Z     T=16384,
2025-05-07T20:31:45.3636664Z     D=5120,
2025-05-07T20:31:45.3636980Z     scale_ub=None,
2025-05-07T20:31:45.3637331Z     contiguous=False,
2025-05-07T20:31:45.3637707Z     compiled=False,
2025-05-07T20:31:45.3638060Z )
2025-05-07T20:31:45.3638592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3639450Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.3639939Z 
2025-05-07T20:31:45.3640064Z     @given(
2025-05-07T20:31:45.3640442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3640958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3641504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3642082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3642630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3643123Z     )
2025-05-07T20:31:45.3643729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3644489Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3644944Z         self,
2025-05-07T20:31:45.3645260Z         T: int,
2025-05-07T20:31:45.3645586Z         D: int,
2025-05-07T20:31:45.3645951Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3646401Z         contiguous: bool,
2025-05-07T20:31:45.3646790Z         compiled: bool,
2025-05-07T20:31:45.3647170Z     ) -> None:
2025-05-07T20:31:45.3647517Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3647928Z     
2025-05-07T20:31:45.3648384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3648956Z     
2025-05-07T20:31:45.3649275Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3649762Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3653399Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3656667Z 
2025-05-07T20:31:45.3656868Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.3657229Z 
2025-05-07T20:31:45.3657405Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3658085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3658775Z     T=4096,
2025-05-07T20:31:45.3659079Z     D=7168,
2025-05-07T20:31:45.3659388Z     scale_ub=1200.0,
2025-05-07T20:31:45.3659757Z     contiguous=True,
2025-05-07T20:31:45.3660204Z     compiled=True,
2025-05-07T20:31:45.3660530Z )
2025-05-07T20:31:45.3661051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3661879Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.3662339Z 
2025-05-07T20:31:45.3662470Z     @given(
2025-05-07T20:31:45.3662836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3663362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3663868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3664412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3664968Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3665842Z     )
2025-05-07T20:31:45.3666476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3667418Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3667827Z         self,
2025-05-07T20:31:45.3668133Z         T: int,
2025-05-07T20:31:45.3668454Z         D: int,
2025-05-07T20:31:45.3668810Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3669250Z         contiguous: bool,
2025-05-07T20:31:45.3669650Z         compiled: bool,
2025-05-07T20:31:45.3670013Z     ) -> None:
2025-05-07T20:31:45.3670365Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3670758Z     
2025-05-07T20:31:45.3671215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3671791Z     
2025-05-07T20:31:45.3672098Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3672580Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3676167Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3679545Z 
2025-05-07T20:31:45.3679757Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.3680120Z 
2025-05-07T20:31:45.3680299Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3680985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3681663Z     T=16384,
2025-05-07T20:31:45.3681969Z     D=7168,
2025-05-07T20:31:45.3682277Z     scale_ub=None,
2025-05-07T20:31:45.3682615Z     contiguous=False,
2025-05-07T20:31:45.3682980Z     compiled=False,
2025-05-07T20:31:45.3683319Z )
2025-05-07T20:31:45.3683834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3684672Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.3685154Z 
2025-05-07T20:31:45.3685277Z     @given(
2025-05-07T20:31:45.3685647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3686160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3686665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3687209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3687750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3688225Z     )
2025-05-07T20:31:45.3688817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3689578Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3690296Z         self,
2025-05-07T20:31:45.3690617Z         T: int,
2025-05-07T20:31:45.3690947Z         D: int,
2025-05-07T20:31:45.3691289Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3691736Z         contiguous: bool,
2025-05-07T20:31:45.3692142Z         compiled: bool,
2025-05-07T20:31:45.3692501Z     ) -> None:
2025-05-07T20:31:45.3692850Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3693254Z     
2025-05-07T20:31:45.3693699Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3697499Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3701248Z 
2025-05-07T20:31:45.3701441Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.3701814Z 
2025-05-07T20:31:45.3702161Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3702868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3703549Z     T=2048,
2025-05-07T20:31:45.3703853Z     D=7168,
2025-05-07T20:31:45.3704161Z     scale_ub=1200.0,
2025-05-07T20:31:45.3704514Z     contiguous=True,
2025-05-07T20:31:45.3704882Z     compiled=True,
2025-05-07T20:31:45.3705218Z )
2025-05-07T20:31:45.3705709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3706487Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.3706944Z 
2025-05-07T20:31:45.3707077Z     @given(
2025-05-07T20:31:45.3707440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3707947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3708470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3709029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3709583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3710068Z     )
2025-05-07T20:31:45.3710661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3711410Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3711808Z         self,
2025-05-07T20:31:45.3712118Z         T: int,
2025-05-07T20:31:45.3712433Z         D: int,
2025-05-07T20:31:45.3712789Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3713242Z         contiguous: bool,
2025-05-07T20:31:45.3713627Z         compiled: bool,
2025-05-07T20:31:45.3713993Z     ) -> None:
2025-05-07T20:31:45.3714342Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3714745Z     
2025-05-07T20:31:45.3715181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3715756Z     
2025-05-07T20:31:45.3716075Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3716547Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3720167Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3723570Z 
2025-05-07T20:31:45.3723767Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.3724138Z 
2025-05-07T20:31:45.3724314Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3725022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3725708Z     T=2048,
2025-05-07T20:31:45.3726012Z     D=7168,
2025-05-07T20:31:45.3726382Z     scale_ub=None,
2025-05-07T20:31:45.3726721Z     contiguous=True,
2025-05-07T20:31:45.3727086Z     compiled=False,
2025-05-07T20:31:45.3727419Z )
2025-05-07T20:31:45.4990498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4991378Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4991822Z 
2025-05-07T20:31:45.4991953Z     @given(
2025-05-07T20:31:45.4992321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4992810Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4993306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4993879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4994386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4994816Z     )
2025-05-07T20:31:45.4995828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4996541Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4997143Z         self,
2025-05-07T20:31:45.4997468Z         T: int,
2025-05-07T20:31:45.4997770Z         D: int,
2025-05-07T20:31:45.4998091Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4998518Z         contiguous: bool,
2025-05-07T20:31:45.4998923Z         compiled: bool,
2025-05-07T20:31:45.4999292Z     ) -> None:
2025-05-07T20:31:45.4999636Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5000030Z     
2025-05-07T20:31:45.5000462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5001041Z     
2025-05-07T20:31:45.5001347Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.5004805Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.5008207Z 
2025-05-07T20:31:45.5008424Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.5008796Z 
2025-05-07T20:31:45.5008968Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5009676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5010370Z     T=1,
2025-05-07T20:31:45.5010666Z     D=7168,
2025-05-07T20:31:45.5010982Z     scale_ub=1200.0,
2025-05-07T20:31:45.5011351Z     contiguous=True,
2025-05-07T20:31:45.5011713Z     compiled=False,
2025-05-07T20:31:45.5012049Z )
2025-05-07T20:31:45.5012577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5013407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.5013852Z 
2025-05-07T20:31:45.5013984Z     @given(
2025-05-07T20:31:45.5014353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5014877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5015383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5015937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5016504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5016983Z     )
2025-05-07T20:31:45.5017581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5018340Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5018747Z         self,
2025-05-07T20:31:45.5019058Z         T: int,
2025-05-07T20:31:45.5019385Z         D: int,
2025-05-07T20:31:45.5019747Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5020304Z         contiguous: bool,
2025-05-07T20:31:45.5020718Z         compiled: bool,
2025-05-07T20:31:45.5021087Z     ) -> None:
2025-05-07T20:31:45.5021445Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5021839Z     
2025-05-07T20:31:45.5022275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5022839Z     
2025-05-07T20:31:45.5023144Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5023609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5024097Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5024471Z         x0 = x[:, :D]
2025-05-07T20:31:45.5024790Z         x1 = x[:, D:]
2025-05-07T20:31:45.5025122Z     
2025-05-07T20:31:45.5025421Z         if contiguous:
2025-05-07T20:31:45.5025781Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5026241Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5026652Z     
2025-05-07T20:31:45.5026963Z         if scale_ub is not None:
2025-05-07T20:31:45.5027413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5028104Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5028615Z             )
2025-05-07T20:31:45.5029076Z         else:
2025-05-07T20:31:45.5029417Z             scale_ub_tensor = None
2025-05-07T20:31:45.5029827Z     
2025-05-07T20:31:45.5030209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5030729Z             op = silu_mul_quant
2025-05-07T20:31:45.5031138Z             if compiled:
2025-05-07T20:31:45.5031538Z                 op = torch.compile(op)
2025-05-07T20:31:45.5032017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5032475Z     
2025-05-07T20:31:45.5032767Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5033023Z 
2025-05-07T20:31:45.5033182Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5033631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5034172Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5034647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5035842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5037052Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5037985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5039175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5040317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5041253Z     kernel = self.compile(
2025-05-07T20:31:45.5042167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5043303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5043976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5044391Z 
2025-05-07T20:31:45.5044730Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fa448e0>
2025-05-07T20:31:45.5046655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5049101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc13b50>}
2025-05-07T20:31:45.5051500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5053306Z context = <triton._C.libtriton.ir.context object at 0x7f1c4faf50f0>
2025-05-07T20:31:45.5053811Z 
2025-05-07T20:31:45.5054096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5055005Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5055808Z                            module_map=module_map)
2025-05-07T20:31:45.5056398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5056978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5057396Z E       ^
2025-05-07T20:31:45.5058204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5059015Z 
2025-05-07T20:31:45.5059749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5060759Z 
2025-05-07T20:31:45.5060938Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5061633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5062458Z     T=128,
2025-05-07T20:31:45.5062755Z     D=5120,
2025-05-07T20:31:45.5063049Z     scale_ub=None,
2025-05-07T20:31:45.5063503Z     contiguous=True,
2025-05-07T20:31:45.5063869Z     compiled=False,
2025-05-07T20:31:45.5064186Z )
2025-05-07T20:31:45.5845574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5846525Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.5846959Z 
2025-05-07T20:31:45.5847070Z     @given(
2025-05-07T20:31:45.5847396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5847865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5848331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5848868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5849400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5849867Z     )
2025-05-07T20:31:45.5850448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5851232Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5851631Z         self,
2025-05-07T20:31:45.5851964Z         T: int,
2025-05-07T20:31:45.5852286Z         D: int,
2025-05-07T20:31:45.5852630Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5853076Z         contiguous: bool,
2025-05-07T20:31:45.5853465Z         compiled: bool,
2025-05-07T20:31:45.5853831Z     ) -> None:
2025-05-07T20:31:45.5854169Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5854567Z     
2025-05-07T20:31:45.5855013Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5855575Z     
2025-05-07T20:31:45.5855888Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5856374Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5856892Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5857290Z         x0 = x[:, :D]
2025-05-07T20:31:45.5857642Z         x1 = x[:, D:]
2025-05-07T20:31:45.5857968Z     
2025-05-07T20:31:45.5858276Z         if contiguous:
2025-05-07T20:31:45.5858654Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5859085Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5859490Z     
2025-05-07T20:31:45.5859914Z         if scale_ub is not None:
2025-05-07T20:31:45.5860375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5860948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5861476Z             )
2025-05-07T20:31:45.5861791Z         else:
2025-05-07T20:31:45.5862128Z             scale_ub_tensor = None
2025-05-07T20:31:45.5862547Z     
2025-05-07T20:31:45.5862926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5863455Z             op = silu_mul_quant
2025-05-07T20:31:45.5863870Z             if compiled:
2025-05-07T20:31:45.5864286Z                 op = torch.compile(op)
2025-05-07T20:31:45.5864780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5865240Z     
2025-05-07T20:31:45.5865554Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5865829Z 
2025-05-07T20:31:45.5865985Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5866476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5867028Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5867485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5868686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5869929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5870867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5872070Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5873245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5874203Z     kernel = self.compile(
2025-05-07T20:31:45.5875767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5876917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5877621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5878001Z 
2025-05-07T20:31:45.5878354Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fa46320>
2025-05-07T20:31:45.5880242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5882664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fa54670>}
2025-05-07T20:31:45.5885072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5886906Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fa66e70>
2025-05-07T20:31:45.5887398Z 
2025-05-07T20:31:45.5887673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5888556Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5889355Z                            module_map=module_map)
2025-05-07T20:31:45.5890183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5890779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5891203Z E       ^
2025-05-07T20:31:45.5891991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5892787Z 
2025-05-07T20:31:45.5893522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5894433Z 
2025-05-07T20:31:45.5894602Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5895305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5895983Z     T=128,
2025-05-07T20:31:45.5896279Z     D=7168,
2025-05-07T20:31:45.5896579Z     scale_ub=None,
2025-05-07T20:31:45.5896930Z     contiguous=True,
2025-05-07T20:31:45.5897293Z     compiled=False,
2025-05-07T20:31:45.5897617Z )
2025-05-07T20:31:45.5898141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5898978Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.5899440Z 
2025-05-07T20:31:45.5899560Z     @given(
2025-05-07T20:31:45.5900021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5900547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5901057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5901615Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5902168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5902643Z     )
2025-05-07T20:31:45.5903221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5903981Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5904369Z         self,
2025-05-07T20:31:45.5904672Z         T: int,
2025-05-07T20:31:45.5904987Z         D: int,
2025-05-07T20:31:45.5905337Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5905777Z         contiguous: bool,
2025-05-07T20:31:45.5906178Z         compiled: bool,
2025-05-07T20:31:45.5906559Z     ) -> None:
2025-05-07T20:31:45.5906896Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5907314Z     
2025-05-07T20:31:45.5907755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5908551Z     
2025-05-07T20:31:45.5908852Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5909326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5909985Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5910373Z         x0 = x[:, :D]
2025-05-07T20:31:45.5910720Z         x1 = x[:, D:]
2025-05-07T20:31:45.5911060Z     
2025-05-07T20:31:45.5922335Z         if contiguous:
2025-05-07T20:31:45.5922745Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5923190Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5923591Z     
2025-05-07T20:31:45.5923911Z         if scale_ub is not None:
2025-05-07T20:31:45.5924376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5924951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5925473Z             )
2025-05-07T20:31:45.5925797Z         else:
2025-05-07T20:31:45.5926146Z             scale_ub_tensor = None
2025-05-07T20:31:45.5926557Z     
2025-05-07T20:31:45.5926941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5927489Z             op = silu_mul_quant
2025-05-07T20:31:45.5927903Z             if compiled:
2025-05-07T20:31:45.5928336Z                 op = torch.compile(op)
2025-05-07T20:31:45.5928821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5929291Z     
2025-05-07T20:31:45.5929608Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5929889Z 
2025-05-07T20:31:45.5930061Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5930552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5931137Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5931609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5932709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5933915Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5934861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5936087Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5937250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5938192Z     kernel = self.compile(
2025-05-07T20:31:45.5939139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5940431Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5941105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5941509Z 
2025-05-07T20:31:45.5941856Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fa46d40>
2025-05-07T20:31:45.5943775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5946269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fa54ee0>}
2025-05-07T20:31:45.5948695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5950519Z context = <triton._C.libtriton.ir.context object at 0x7f1c4f9b51f0>
2025-05-07T20:31:45.5951033Z 
2025-05-07T20:31:45.5951310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5952218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5953035Z                            module_map=module_map)
2025-05-07T20:31:45.5953636Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5954381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5954812Z E       ^
2025-05-07T20:31:45.5955694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5956475Z 
2025-05-07T20:31:45.5957188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5958070Z 
2025-05-07T20:31:45.5958246Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5958936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5959604Z     T=2048,
2025-05-07T20:31:45.5959911Z     D=7168,
2025-05-07T20:31:45.5960227Z     scale_ub=1200.0,
2025-05-07T20:31:45.5960588Z     contiguous=True,
2025-05-07T20:31:45.5960961Z     compiled=False,
2025-05-07T20:31:45.5961305Z )
2025-05-07T20:31:45.6908687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6909591Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6910064Z 
2025-05-07T20:31:45.6910192Z     @given(
2025-05-07T20:31:45.6910528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6911007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6911501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6912054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6912619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6913104Z     )
2025-05-07T20:31:45.6913681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6914451Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6914873Z         self,
2025-05-07T20:31:45.6915194Z         T: int,
2025-05-07T20:31:45.6915525Z         D: int,
2025-05-07T20:31:45.6915884Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6916336Z         contiguous: bool,
2025-05-07T20:31:45.6916746Z         compiled: bool,
2025-05-07T20:31:45.6917099Z     ) -> None:
2025-05-07T20:31:45.6917448Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6917848Z     
2025-05-07T20:31:45.6918278Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6921980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.6925458Z 
2025-05-07T20:31:45.6925660Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.6926049Z 
2025-05-07T20:31:45.6926218Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6926941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6927628Z     T=1,
2025-05-07T20:31:45.6927925Z     D=5120,
2025-05-07T20:31:45.6928234Z     scale_ub=1200.0,
2025-05-07T20:31:45.6928585Z     contiguous=True,
2025-05-07T20:31:45.6928936Z     compiled=False,
2025-05-07T20:31:45.6929268Z )
2025-05-07T20:31:45.6929784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6930611Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.6931065Z 
2025-05-07T20:31:45.6931194Z     @given(
2025-05-07T20:31:45.6931563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6932096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6932619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6933180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6934170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6934658Z     )
2025-05-07T20:31:45.6935444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6936227Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6936639Z         self,
2025-05-07T20:31:45.6936956Z         T: int,
2025-05-07T20:31:45.6937265Z         D: int,
2025-05-07T20:31:45.6937619Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6938075Z         contiguous: bool,
2025-05-07T20:31:45.6938461Z         compiled: bool,
2025-05-07T20:31:45.6938827Z     ) -> None:
2025-05-07T20:31:45.6939169Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6939553Z     
2025-05-07T20:31:45.6940108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6940711Z     
2025-05-07T20:31:45.6941013Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.6941500Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.6942030Z         x = x_sign * x_clamp
2025-05-07T20:31:45.6942423Z         x0 = x[:, :D]
2025-05-07T20:31:45.6942759Z         x1 = x[:, D:]
2025-05-07T20:31:45.6943090Z     
2025-05-07T20:31:45.6943387Z         if contiguous:
2025-05-07T20:31:45.6943754Z             x0 = x0.contiguous()
2025-05-07T20:31:45.6944176Z             x1 = x1.contiguous()
2025-05-07T20:31:45.6944574Z     
2025-05-07T20:31:45.6944879Z         if scale_ub is not None:
2025-05-07T20:31:45.6945330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.6945879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.6946428Z             )
2025-05-07T20:31:45.6946745Z         else:
2025-05-07T20:31:45.6947082Z             scale_ub_tensor = None
2025-05-07T20:31:45.6947486Z     
2025-05-07T20:31:45.6947866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.6948397Z             op = silu_mul_quant
2025-05-07T20:31:45.6948805Z             if compiled:
2025-05-07T20:31:45.6949214Z                 op = torch.compile(op)
2025-05-07T20:31:45.6949701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6950146Z     
2025-05-07T20:31:45.6950448Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.6950727Z 
2025-05-07T20:31:45.6950889Z moe/activation_test.py:117: 
2025-05-07T20:31:45.6951369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6951917Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.6952380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.6953576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.6954773Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.6955700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.6956858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.6958032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.6958948Z     kernel = self.compile(
2025-05-07T20:31:45.6959885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.6961015Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.6961688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.6962083Z 
2025-05-07T20:31:45.6962409Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fae9ba0>
2025-05-07T20:31:45.6964303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.6967040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fa55e10>}
2025-05-07T20:31:45.6969426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.6971229Z context = <triton._C.libtriton.ir.context object at 0x7f1c4f9aa070>
2025-05-07T20:31:45.6971734Z 
2025-05-07T20:31:45.6972008Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.6972895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.6973698Z                            module_map=module_map)
2025-05-07T20:31:45.6974301Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.6974891Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.6975334Z E       ^
2025-05-07T20:31:45.6976116Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.6976978Z 
2025-05-07T20:31:45.6977709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.6978630Z 
2025-05-07T20:31:45.6978801Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.6979505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.6980248Z     T=2048,
2025-05-07T20:31:45.6980558Z     D=5120,
2025-05-07T20:31:45.6980867Z     scale_ub=None,
2025-05-07T20:31:45.6981208Z     contiguous=True,
2025-05-07T20:31:45.6981573Z     compiled=False,
2025-05-07T20:31:45.6981916Z )
2025-05-07T20:31:45.6982441Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.6983288Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.6983780Z 
2025-05-07T20:31:45.6983904Z     @given(
2025-05-07T20:31:45.6984280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.6984806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.6985325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.6985884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.6986434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.6986916Z     )
2025-05-07T20:31:45.6987506Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.6988272Z     def test_silu_mul_quant(
2025-05-07T20:31:45.6988675Z         self,
2025-05-07T20:31:45.6988991Z         T: int,
2025-05-07T20:31:45.6989307Z         D: int,
2025-05-07T20:31:45.6989648Z         scale_ub: Optional[float],
2025-05-07T20:31:45.6990375Z         contiguous: bool,
2025-05-07T20:31:45.6990769Z         compiled: bool,
2025-05-07T20:31:45.6991121Z     ) -> None:
2025-05-07T20:31:45.6991480Z         torch.manual_seed(2025)
2025-05-07T20:31:45.6991880Z     
2025-05-07T20:31:45.6992323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.6992910Z     
2025-05-07T20:31:45.6993222Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.6996654Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.7000072Z 
2025-05-07T20:31:45.7000273Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.7000646Z 
2025-05-07T20:31:45.7001056Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.7001953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.7002659Z     T=16384,
2025-05-07T20:31:45.7002965Z     D=5120,
2025-05-07T20:31:45.7003280Z     scale_ub=None,
2025-05-07T20:31:45.7003632Z     contiguous=True,
2025-05-07T20:31:45.7003988Z     compiled=False,
2025-05-07T20:31:45.7004323Z )
2025-05-07T20:31:45.7956074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.7956999Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.7957460Z 
2025-05-07T20:31:45.7957594Z     @given(
2025-05-07T20:31:45.7957943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.7958432Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.7958931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.7959472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.7960038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.7960523Z     )
2025-05-07T20:31:45.7961117Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.7961870Z     def test_silu_mul_quant(
2025-05-07T20:31:45.7962272Z         self,
2025-05-07T20:31:45.7962584Z         T: int,
2025-05-07T20:31:45.7962890Z         D: int,
2025-05-07T20:31:45.7963239Z         scale_ub: Optional[float],
2025-05-07T20:31:45.7963684Z         contiguous: bool,
2025-05-07T20:31:45.7964070Z         compiled: bool,
2025-05-07T20:31:45.7964431Z     ) -> None:
2025-05-07T20:31:45.7964770Z         torch.manual_seed(2025)
2025-05-07T20:31:45.7965156Z     
2025-05-07T20:31:45.7965597Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.7969308Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.7972772Z 
2025-05-07T20:31:45.7972976Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.7973341Z 
2025-05-07T20:31:45.7973515Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.7974224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.7974902Z     T=4096,
2025-05-07T20:31:45.7975205Z     D=5120,
2025-05-07T20:31:45.7975502Z     scale_ub=None,
2025-05-07T20:31:45.7975852Z     contiguous=True,
2025-05-07T20:31:45.7976208Z     compiled=False,
2025-05-07T20:31:45.7976534Z )
2025-05-07T20:31:45.7977060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.7977917Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.7978384Z 
2025-05-07T20:31:45.7978514Z     @given(
2025-05-07T20:31:45.7978886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.7979415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.7980061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.7980621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.7981177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.7981663Z     )
2025-05-07T20:31:45.7982259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.7983009Z     def test_silu_mul_quant(
2025-05-07T20:31:45.7983414Z         self,
2025-05-07T20:31:45.7983723Z         T: int,
2025-05-07T20:31:45.7984050Z         D: int,
2025-05-07T20:31:45.7984404Z         scale_ub: Optional[float],
2025-05-07T20:31:45.7985176Z         contiguous: bool,
2025-05-07T20:31:45.7985572Z         compiled: bool,
2025-05-07T20:31:45.7985934Z     ) -> None:
2025-05-07T20:31:45.7986497Z         torch.manual_seed(2025)
2025-05-07T20:31:45.7986885Z     
2025-05-07T20:31:45.7987346Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.7991259Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.7994623Z 
2025-05-07T20:31:45.7994837Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.7995210Z 
2025-05-07T20:31:45.7995382Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.7996088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.7996761Z     T=2048,
2025-05-07T20:31:45.7997059Z     D=5120,
2025-05-07T20:31:45.7997366Z     scale_ub=None,
2025-05-07T20:31:45.7997709Z     contiguous=False,
2025-05-07T20:31:45.7998073Z     compiled=False,
2025-05-07T20:31:45.7998407Z )
2025-05-07T20:31:45.7998934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.7999775Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.8000239Z 
2025-05-07T20:31:45.8000369Z     @given(
2025-05-07T20:31:45.8000737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.8001254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.8001764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.8002331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.8002875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.8003345Z     )
2025-05-07T20:31:45.8003933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.8004685Z     def test_silu_mul_quant(
2025-05-07T20:31:45.8005083Z         self,
2025-05-07T20:31:45.8005398Z         T: int,
2025-05-07T20:31:45.8005700Z         D: int,
2025-05-07T20:31:45.8006052Z         scale_ub: Optional[float],
2025-05-07T20:31:45.8006495Z         contiguous: bool,
2025-05-07T20:31:45.8006880Z         compiled: bool,
2025-05-07T20:31:45.8007240Z     ) -> None:
2025-05-07T20:31:45.8007583Z         torch.manual_seed(2025)
2025-05-07T20:31:45.8007974Z     
2025-05-07T20:31:45.8008421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.8012067Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.8015452Z 
2025-05-07T20:31:45.8015653Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.8016015Z 
2025-05-07T20:31:45.8016189Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8016875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8017562Z     T=4096,
2025-05-07T20:31:45.8017865Z     D=7168,
2025-05-07T20:31:45.8018162Z     scale_ub=None,
2025-05-07T20:31:45.8018512Z     contiguous=True,
2025-05-07T20:31:45.8018870Z     compiled=True,
2025-05-07T20:31:45.8019410Z )
2025-05-07T20:31:45.8020061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.8021070Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.8021546Z 
2025-05-07T20:31:45.8021675Z     @given(
2025-05-07T20:31:45.8022037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.8022562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.8023059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.8023604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.8024161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.8024637Z     )
2025-05-07T20:31:45.8025221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.8025991Z     def test_silu_mul_quant(
2025-05-07T20:31:45.8026443Z         self,
2025-05-07T20:31:45.8026752Z         T: int,
2025-05-07T20:31:45.8027077Z         D: int,
2025-05-07T20:31:45.8027431Z         scale_ub: Optional[float],
2025-05-07T20:31:45.8027881Z         contiguous: bool,
2025-05-07T20:31:45.8028277Z         compiled: bool,
2025-05-07T20:31:45.8028642Z     ) -> None:
2025-05-07T20:31:45.8028996Z         torch.manual_seed(2025)
2025-05-07T20:31:45.8029384Z     
2025-05-07T20:31:45.8029829Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.8033547Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.8036950Z 
2025-05-07T20:31:45.8037161Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.8037521Z 
2025-05-07T20:31:45.8037706Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8038398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8039087Z     T=2048,
2025-05-07T20:31:45.8039389Z     D=5120,
2025-05-07T20:31:45.8039685Z     scale_ub=1200.0,
2025-05-07T20:31:45.8040054Z     contiguous=False,
2025-05-07T20:31:45.8040425Z     compiled=False,
2025-05-07T20:31:45.8040752Z )
2025-05-07T20:31:45.8041280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.8042112Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.8042590Z 
2025-05-07T20:31:45.8042714Z     @given(
2025-05-07T20:31:45.8043094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.8043622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.8044147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.8044701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.8045265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.8045750Z     )
2025-05-07T20:31:45.8046335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.8047099Z     def test_silu_mul_quant(
2025-05-07T20:31:45.8047501Z         self,
2025-05-07T20:31:45.8047814Z         T: int,
2025-05-07T20:31:45.8048135Z         D: int,
2025-05-07T20:31:45.8048498Z         scale_ub: Optional[float],
2025-05-07T20:31:45.8048942Z         contiguous: bool,
2025-05-07T20:31:45.8049334Z         compiled: bool,
2025-05-07T20:31:45.8049703Z     ) -> None:
2025-05-07T20:31:45.8050045Z         torch.manual_seed(2025)
2025-05-07T20:31:45.8050445Z     
2025-05-07T20:31:45.8050888Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.8054728Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.8058218Z 
2025-05-07T20:31:45.8058426Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.8058790Z 
2025-05-07T20:31:45.8058958Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8059666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8060476Z     T=4096,
2025-05-07T20:31:45.8060777Z     D=7168,
2025-05-07T20:31:45.8061093Z     scale_ub=1200.0,
2025-05-07T20:31:45.8061468Z     contiguous=True,
2025-05-07T20:31:45.8061830Z     compiled=False,
2025-05-07T20:31:45.8062157Z )
2025-05-07T20:31:45.9328048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9328956Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.9329428Z 
2025-05-07T20:31:45.9329560Z     @given(
2025-05-07T20:31:45.9329931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9330432Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9330932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9331475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9331960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9332400Z     )
2025-05-07T20:31:45.9332947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9333657Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9334061Z         self,
2025-05-07T20:31:45.9334379Z         T: int,
2025-05-07T20:31:45.9334686Z         D: int,
2025-05-07T20:31:45.9335044Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9335495Z         contiguous: bool,
2025-05-07T20:31:45.9335893Z         compiled: bool,
2025-05-07T20:31:45.9336262Z     ) -> None:
2025-05-07T20:31:45.9336624Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9337042Z     
2025-05-07T20:31:45.9337487Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9353978Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.9357546Z 
2025-05-07T20:31:45.9357763Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.9358148Z 
2025-05-07T20:31:45.9358325Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9359056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9359704Z     T=16384,
2025-05-07T20:31:45.9359999Z     D=7168,
2025-05-07T20:31:45.9360290Z     scale_ub=None,
2025-05-07T20:31:45.9360641Z     contiguous=False,
2025-05-07T20:31:45.9361019Z     compiled=True,
2025-05-07T20:31:45.9361372Z )
2025-05-07T20:31:45.9361910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9362788Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.9363270Z 
2025-05-07T20:31:45.9363412Z     @given(
2025-05-07T20:31:45.9363791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9364776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9365303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9366090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9366665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9367167Z     )
2025-05-07T20:31:45.9367775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9368558Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9368973Z         self,
2025-05-07T20:31:45.9369305Z         T: int,
2025-05-07T20:31:45.9369625Z         D: int,
2025-05-07T20:31:45.9369995Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9370463Z         contiguous: bool,
2025-05-07T20:31:45.9370864Z         compiled: bool,
2025-05-07T20:31:45.9371245Z     ) -> None:
2025-05-07T20:31:45.9371617Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9372020Z     
2025-05-07T20:31:45.9372485Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9376254Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.9379712Z 
2025-05-07T20:31:45.9379995Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.9380350Z 
2025-05-07T20:31:45.9380522Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9381168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9381848Z     T=4096,
2025-05-07T20:31:45.9382172Z     D=7168,
2025-05-07T20:31:45.9382481Z     scale_ub=None,
2025-05-07T20:31:45.9382846Z     contiguous=True,
2025-05-07T20:31:45.9383237Z     compiled=False,
2025-05-07T20:31:45.9383584Z )
2025-05-07T20:31:45.9384135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9384988Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.9385466Z 
2025-05-07T20:31:45.9385604Z     @given(
2025-05-07T20:31:45.9385981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9386516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9387035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9387598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9388172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9388672Z     )
2025-05-07T20:31:45.9389268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9390403Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9390803Z         self,
2025-05-07T20:31:45.9391114Z         T: int,
2025-05-07T20:31:45.9391446Z         D: int,
2025-05-07T20:31:45.9391811Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9392269Z         contiguous: bool,
2025-05-07T20:31:45.9392667Z         compiled: bool,
2025-05-07T20:31:45.9393048Z     ) -> None:
2025-05-07T20:31:45.9393409Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9393812Z     
2025-05-07T20:31:45.9394272Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9398182Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.9401732Z 
2025-05-07T20:31:45.9401946Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.9402301Z 
2025-05-07T20:31:45.9402482Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9403194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9403914Z     T=16384,
2025-05-07T20:31:45.9404241Z     D=7168,
2025-05-07T20:31:45.9404557Z     scale_ub=None,
2025-05-07T20:31:45.9404925Z     contiguous=True,
2025-05-07T20:31:45.9405298Z     compiled=False,
2025-05-07T20:31:45.9405642Z )
2025-05-07T20:31:45.9406186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9407094Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.9407590Z 
2025-05-07T20:31:45.9407720Z     @given(
2025-05-07T20:31:45.9408116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9408646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9409200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9409776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9410350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9410833Z     )
2025-05-07T20:31:45.9411437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9412208Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9412609Z         self,
2025-05-07T20:31:45.9412938Z         T: int,
2025-05-07T20:31:45.9413265Z         D: int,
2025-05-07T20:31:45.9413625Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9414089Z         contiguous: bool,
2025-05-07T20:31:45.9414497Z         compiled: bool,
2025-05-07T20:31:45.9414866Z     ) -> None:
2025-05-07T20:31:45.9415232Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9415654Z     
2025-05-07T20:31:45.9416105Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9419906Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.9423331Z 
2025-05-07T20:31:45.9423539Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.9423893Z 
2025-05-07T20:31:45.9424067Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9424779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9425492Z     T=16384,
2025-05-07T20:31:45.9425813Z     D=7168,
2025-05-07T20:31:45.9426132Z     scale_ub=1200.0,
2025-05-07T20:31:45.9426504Z     contiguous=True,
2025-05-07T20:31:45.9426877Z     compiled=False,
2025-05-07T20:31:45.9427228Z )
2025-05-07T20:31:45.9427763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9428625Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.9429124Z 
2025-05-07T20:31:45.9429255Z     @given(
2025-05-07T20:31:45.9429643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9430167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9430700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9431279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9431841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9432343Z     )
2025-05-07T20:31:45.9432958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9433882Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9434420Z         self,
2025-05-07T20:31:45.9434753Z         T: int,
2025-05-07T20:31:45.9435073Z         D: int,
2025-05-07T20:31:45.9435445Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9435908Z         contiguous: bool,
2025-05-07T20:31:45.9436322Z         compiled: bool,
2025-05-07T20:31:45.9436687Z     ) -> None:
2025-05-07T20:31:45.9437049Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9437462Z     
2025-05-07T20:31:45.9437909Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9441605Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.9444964Z 
2025-05-07T20:31:45.9445175Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.9445547Z 
2025-05-07T20:31:45.9445733Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.9446446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.9447149Z     T=128,
2025-05-07T20:31:45.9447468Z     D=5120,
2025-05-07T20:31:45.9447786Z     scale_ub=1200.0,
2025-05-07T20:31:45.9448152Z     contiguous=False,
2025-05-07T20:31:45.9448545Z     compiled=False,
2025-05-07T20:31:45.9448889Z )
2025-05-07T20:31:46.3055193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3056088Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.3056586Z 
2025-05-07T20:31:46.3056719Z     @given(
2025-05-07T20:31:46.3057102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3057620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3058129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3058664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3059209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3059695Z     )
2025-05-07T20:31:46.3060459Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3061116Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3061458Z         self,
2025-05-07T20:31:46.3061726Z         T: int,
2025-05-07T20:31:46.3061992Z         D: int,
2025-05-07T20:31:46.3062294Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3062681Z         contiguous: bool,
2025-05-07T20:31:46.3063013Z         compiled: bool,
2025-05-07T20:31:46.3063334Z     ) -> None:
2025-05-07T20:31:46.3063657Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3064007Z     
2025-05-07T20:31:46.3064418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3064950Z     
2025-05-07T20:31:46.3065238Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3065655Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3066106Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3066486Z         x0 = x[:, :D]
2025-05-07T20:31:46.3066827Z         x1 = x[:, D:]
2025-05-07T20:31:46.3067133Z     
2025-05-07T20:31:46.3067421Z         if contiguous:
2025-05-07T20:31:46.3067778Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3068177Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3068565Z     
2025-05-07T20:31:46.3068881Z         if scale_ub is not None:
2025-05-07T20:31:46.3069354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3069932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3070914Z             )
2025-05-07T20:31:46.3071238Z         else:
2025-05-07T20:31:46.3071594Z             scale_ub_tensor = None
2025-05-07T20:31:46.3072244Z     
2025-05-07T20:31:46.3072646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3073195Z             op = silu_mul_quant
2025-05-07T20:31:46.3073610Z             if compiled:
2025-05-07T20:31:46.3074028Z                 op = torch.compile(op)
2025-05-07T20:31:46.3074532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3074999Z     
2025-05-07T20:31:46.3075322Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3075603Z 
2025-05-07T20:31:46.3075779Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3076279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3076846Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3077290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3078457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3079606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3080537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3081706Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3082835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3083774Z     kernel = self.compile(
2025-05-07T20:31:46.3084717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3085847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3086523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3086927Z 
2025-05-07T20:31:46.3087277Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c541fa4d0>
2025-05-07T20:31:46.3089234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3092123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54125cf0>}
2025-05-07T20:31:46.3094502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3096314Z context = <triton._C.libtriton.ir.context object at 0x7f1c5421e370>
2025-05-07T20:31:46.3096830Z 
2025-05-07T20:31:46.3097113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.3098029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3098852Z                            module_map=module_map)
2025-05-07T20:31:46.3099468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3100153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3100598Z E       ^
2025-05-07T20:31:46.3101413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3102207Z 
2025-05-07T20:31:46.3102932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.3103807Z 
2025-05-07T20:31:46.3103967Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3104614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3105307Z     T=2048,
2025-05-07T20:31:46.3105606Z     D=7168,
2025-05-07T20:31:46.3106174Z     scale_ub=None,
2025-05-07T20:31:46.3106541Z     contiguous=False,
2025-05-07T20:31:46.3106885Z     compiled=False,
2025-05-07T20:31:46.3107206Z )
2025-05-07T20:31:46.3107882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3108709Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.3109165Z 
2025-05-07T20:31:46.3109297Z     @given(
2025-05-07T20:31:46.3109651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3110161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3110664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3111210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3111749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3112227Z     )
2025-05-07T20:31:46.3112810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3113556Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3113977Z         self,
2025-05-07T20:31:46.3114281Z         T: int,
2025-05-07T20:31:46.3114587Z         D: int,
2025-05-07T20:31:46.3114950Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3115380Z         contiguous: bool,
2025-05-07T20:31:46.3115768Z         compiled: bool,
2025-05-07T20:31:46.3116133Z     ) -> None:
2025-05-07T20:31:46.3116483Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3116871Z     
2025-05-07T20:31:46.3117325Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3121007Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.3124367Z 
2025-05-07T20:31:46.3124568Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.3124931Z 
2025-05-07T20:31:46.3125107Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3125797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3126489Z     T=128,
2025-05-07T20:31:46.3126791Z     D=7168,
2025-05-07T20:31:46.3127100Z     scale_ub=1200.0,
2025-05-07T20:31:46.3127456Z     contiguous=True,
2025-05-07T20:31:46.3127807Z     compiled=True,
2025-05-07T20:31:46.3128133Z )
2025-05-07T20:31:46.3544228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3545117Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.3545561Z 
2025-05-07T20:31:46.3545686Z     @given(
2025-05-07T20:31:46.3546038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3546494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3546934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3547431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3547964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3548420Z     )
2025-05-07T20:31:46.3549019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3549785Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3550185Z         self,
2025-05-07T20:31:46.3550469Z         T: int,
2025-05-07T20:31:46.3550765Z         D: int,
2025-05-07T20:31:46.3551093Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3551509Z         contiguous: bool,
2025-05-07T20:31:46.3551866Z         compiled: bool,
2025-05-07T20:31:46.3552205Z     ) -> None:
2025-05-07T20:31:46.3552535Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3552935Z     
2025-05-07T20:31:46.3553661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3554147Z     
2025-05-07T20:31:46.3554562Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3554987Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3555428Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3555771Z         x0 = x[:, :D]
2025-05-07T20:31:46.3556084Z         x1 = x[:, D:]
2025-05-07T20:31:46.3556378Z     
2025-05-07T20:31:46.3556653Z         if contiguous:
2025-05-07T20:31:46.3557001Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3557387Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3557744Z     
2025-05-07T20:31:46.3558021Z         if scale_ub is not None:
2025-05-07T20:31:46.3558415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3558905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3559345Z             )
2025-05-07T20:31:46.3559621Z         else:
2025-05-07T20:31:46.3559915Z             scale_ub_tensor = None
2025-05-07T20:31:46.3560285Z     
2025-05-07T20:31:46.3560612Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3561062Z             op = silu_mul_quant
2025-05-07T20:31:46.3561423Z             if compiled:
2025-05-07T20:31:46.3561774Z                 op = torch.compile(op)
2025-05-07T20:31:46.3562188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3562598Z     
2025-05-07T20:31:46.3562865Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3563099Z 
2025-05-07T20:31:46.3563236Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3563659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3564138Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3564542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3565359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.3566195Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.3567196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3568237Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3569036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3570062Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3571065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3571857Z     kernel = self.compile(
2025-05-07T20:31:46.3572665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3573659Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3574232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3574588Z 
2025-05-07T20:31:46.3574881Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4f89d8a0>
2025-05-07T20:31:46.3576587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3578757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c541270a0>}
2025-05-07T20:31:46.3581077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3582747Z context = <triton._C.libtriton.ir.context object at 0x7f1c4f8d0d70>
2025-05-07T20:31:46.3583206Z 
2025-05-07T20:31:46.3583459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.3584537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.3585281Z                            module_map=module_map)
2025-05-07T20:31:46.3585823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.3586359Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.3586752Z E       ^
2025-05-07T20:31:46.3587485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.3588242Z 
2025-05-07T20:31:46.3588923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.3589776Z 
2025-05-07T20:31:46.3590281Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3590924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3591541Z     T=128,
2025-05-07T20:31:46.3591824Z     D=7168,
2025-05-07T20:31:46.3592105Z     scale_ub=1200.0,
2025-05-07T20:31:46.3592421Z     contiguous=True,
2025-05-07T20:31:46.3592755Z     compiled=False,
2025-05-07T20:31:46.3593057Z )
2025-05-07T20:31:46.3593527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3594293Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.3594721Z 
2025-05-07T20:31:46.3594834Z     @given(
2025-05-07T20:31:46.3595165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3595625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3596091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3596601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3597099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3597532Z     )
2025-05-07T20:31:46.3598068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3598766Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3599122Z         self,
2025-05-07T20:31:46.3599409Z         T: int,
2025-05-07T20:31:46.3599687Z         D: int,
2025-05-07T20:31:46.3599994Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3600382Z         contiguous: bool,
2025-05-07T20:31:46.3600721Z         compiled: bool,
2025-05-07T20:31:46.3601031Z     ) -> None:
2025-05-07T20:31:46.3601338Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3601692Z     
2025-05-07T20:31:46.3602081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3602596Z     
2025-05-07T20:31:46.3602880Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3603316Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3606401Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.3609205Z 
2025-05-07T20:31:46.3609380Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:46.3609696Z 
2025-05-07T20:31:46.3609846Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3610446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3611023Z     T=128,
2025-05-07T20:31:46.3611293Z     D=5120,
2025-05-07T20:31:46.3611560Z     scale_ub=1200.0,
2025-05-07T20:31:46.3611863Z     contiguous=True,
2025-05-07T20:31:46.3612174Z     compiled=True,
2025-05-07T20:31:46.3612458Z )
2025-05-07T20:31:46.3612917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3613948Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.3625549Z 
2025-05-07T20:31:46.3625679Z     @given(
2025-05-07T20:31:46.3626026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3626475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3626927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3627421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3627905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3628315Z     )
2025-05-07T20:31:46.3628826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3629493Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3629836Z         self,
2025-05-07T20:31:46.3630115Z         T: int,
2025-05-07T20:31:46.3630397Z         D: int,
2025-05-07T20:31:46.3630700Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3631109Z         contiguous: bool,
2025-05-07T20:31:46.3631455Z         compiled: bool,
2025-05-07T20:31:46.3631768Z     ) -> None:
2025-05-07T20:31:46.3632087Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3632442Z     
2025-05-07T20:31:46.3632822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3633325Z     
2025-05-07T20:31:46.3633607Z >       x_sign = torch.sign(x)
2025-05-07T20:31:46.3636573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.3639421Z 
2025-05-07T20:31:46.3639601Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:46.3639913Z 
2025-05-07T20:31:46.3640072Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3640686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3641282Z     T=128,
2025-05-07T20:31:46.3641545Z     D=7168,
2025-05-07T20:31:46.3641817Z     scale_ub=None,
2025-05-07T20:31:46.3642124Z     contiguous=True,
2025-05-07T20:31:46.3642435Z     compiled=True,
2025-05-07T20:31:46.3642728Z )
2025-05-07T20:31:46.6709494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.6710048Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.6710315Z 
2025-05-07T20:31:46.6710400Z     @given(
2025-05-07T20:31:46.6710647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6710969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.6711308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.6711632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.6711986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.6712286Z     )
2025-05-07T20:31:46.6712636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.6713081Z     def test_silu_mul_quant(
2025-05-07T20:31:46.6713330Z         self,
2025-05-07T20:31:46.6713523Z         T: int,
2025-05-07T20:31:46.6713728Z         D: int,
2025-05-07T20:31:46.6713960Z         scale_ub: Optional[float],
2025-05-07T20:31:46.6714229Z         contiguous: bool,
2025-05-07T20:31:46.6714476Z         compiled: bool,
2025-05-07T20:31:46.6714709Z     ) -> None:
2025-05-07T20:31:46.6714925Z         torch.manual_seed(2025)
2025-05-07T20:31:46.6715176Z     
2025-05-07T20:31:46.6715456Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6717762Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.6719852Z 
2025-05-07T20:31:46.6719974Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.6720189Z 
2025-05-07T20:31:46.6778441Z FAILED
2025-05-07T20:31:46.6778581Z 
2025-05-07T20:31:46.6778736Z =================================== FAILURES ===================================
2025-05-07T20:31:46.6779279Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:46.6780129Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:46.6781022Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:31:46.6781792Z   |     yield
2025-05-07T20:31:46.6782394Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:31:46.6783114Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:46.6783898Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:31:46.6784655Z   |     method()
2025-05-07T20:31:46.6785547Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:46.6786553Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6787443Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:46.6788319Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:46.6788996Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:46.6789669Z   +-+---------------- 1 ----------------
2025-05-07T20:31:46.6790312Z     | Traceback (most recent call last):
2025-05-07T20:31:46.6791313Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:46.6792386Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6795276Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.6798017Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:46.6798469Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6798881Z     |     T=128,
2025-05-07T20:31:46.6799083Z     |     D=7168,
2025-05-07T20:31:46.6799303Z     |     scale_ub=1200.0,
2025-05-07T20:31:46.6799553Z     |     contiguous=True,
2025-05-07T20:31:46.6799798Z     |     compiled=False,
2025-05-07T20:31:46.6800032Z     | )
2025-05-07T20:31:46.6800220Z     | 
2025-05-07T20:31:46.6800743Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:46.6801351Z     +---------------- 2 ----------------
2025-05-07T20:31:46.6801650Z     | Traceback (most recent call last):
2025-05-07T20:31:46.6802701Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:46.6803485Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6805522Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.6807523Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:46.6807975Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6808377Z     |     T=128,
2025-05-07T20:31:46.6808594Z     |     D=7168,
2025-05-07T20:31:46.6808815Z     |     scale_ub=None,
2025-05-07T20:31:46.6809071Z     |     contiguous=True,
2025-05-07T20:31:46.6809317Z     |     compiled=True,
2025-05-07T20:31:46.6809551Z     | )
2025-05-07T20:31:46.6809741Z     | 
2025-05-07T20:31:46.6810267Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:46.6810887Z     +---------------- 3 ----------------
2025-05-07T20:31:46.6811186Z     | Traceback (most recent call last):
2025-05-07T20:31:46.6811899Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:46.6812678Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6814730Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.6817562Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:46.6818175Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6818734Z     |     T=128,
2025-05-07T20:31:46.6819017Z     |     D=5120,
2025-05-07T20:31:46.6819302Z     |     scale_ub=1200.0,
2025-05-07T20:31:46.6819629Z     |     contiguous=True,
2025-05-07T20:31:46.6820143Z     |     compiled=True,
2025-05-07T20:31:46.6820466Z     | )
2025-05-07T20:31:46.6820702Z     | 
2025-05-07T20:31:46.6821432Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:46.6822281Z     +---------------- 4 ----------------
2025-05-07T20:31:46.6822682Z     | Traceback (most recent call last):
2025-05-07T20:31:46.6823665Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:46.6824654Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.6825576Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:46.6826538Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.6827685Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:46.6828997Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.6829846Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:46.6830867Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.6831899Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:46.6832981Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.6834076Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:46.6834888Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.6835678Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:46.6836365Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.6837013Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:46.6837574Z     |     fn()
2025-05-07T20:31:46.6838137Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:46.6838763Z     |     self.fn.run(
2025-05-07T20:31:46.6839292Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:46.6839864Z     |     kernel = self.compile(
2025-05-07T20:31:46.6840465Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:46.6841177Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.6841885Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:46.6842670Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.6843195Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.6843555Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.6843811Z     | ^
2025-05-07T20:31:46.6844271Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.6844836Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:46.6845233Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:46.6845751Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6846192Z     |     T=1,  # or any other generated value
2025-05-07T20:31:46.6846506Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:46.6846841Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:46.6847204Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:46.6847570Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:46.6847869Z     | )
2025-05-07T20:31:46.6848044Z     | 
2025-05-07T20:31:46.6848682Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:46.6849545Z     +------------------------------------
2025-05-07T20:31:46.6850057Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:46.6850602Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.6851354Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6851907Z     T=1,
2025-05-07T20:31:46.6852294Z     D=5120,
2025-05-07T20:31:46.6852567Z     scale_ub=None,
2025-05-07T20:31:46.6852860Z     contiguous=True,
2025-05-07T20:31:46.6853172Z     compiled=True,
2025-05-07T20:31:46.6853471Z )
2025-05-07T20:31:46.6853905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.6854574Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.6854944Z 
2025-05-07T20:31:46.6855057Z     @given(
2025-05-07T20:31:46.6855387Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6855824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.6856262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.6856731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.6857185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.6857603Z     )
2025-05-07T20:31:46.6858098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.6858712Z     def test_silu_mul_quant(
2025-05-07T20:31:46.6859023Z         self,
2025-05-07T20:31:46.6859285Z         T: int,
2025-05-07T20:31:46.6859534Z         D: int,
2025-05-07T20:31:46.6859826Z         scale_ub: Optional[float],
2025-05-07T20:31:46.6860449Z         contiguous: bool,
2025-05-07T20:31:46.6860786Z         compiled: bool,
2025-05-07T20:31:46.6861101Z     ) -> None:
2025-05-07T20:31:46.6861403Z         torch.manual_seed(2025)
2025-05-07T20:31:46.6861747Z     
2025-05-07T20:31:46.6862125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6862610Z     
2025-05-07T20:31:46.6862894Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.6863301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.6863743Z         x = x_sign * x_clamp
2025-05-07T20:31:46.6864083Z         x0 = x[:, :D]
2025-05-07T20:31:46.6864366Z         x1 = x[:, D:]
2025-05-07T20:31:46.6864640Z     
2025-05-07T20:31:46.6864897Z         if contiguous:
2025-05-07T20:31:46.6865209Z             x0 = x0.contiguous()
2025-05-07T20:31:46.6865576Z             x1 = x1.contiguous()
2025-05-07T20:31:46.6865902Z     
2025-05-07T20:31:46.6866135Z         if scale_ub is not None:
2025-05-07T20:31:46.6866473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.6866930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.6867371Z             )
2025-05-07T20:31:46.6867639Z         else:
2025-05-07T20:31:46.6867941Z             scale_ub_tensor = None
2025-05-07T20:31:46.6868297Z     
2025-05-07T20:31:46.6868615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6869063Z             op = silu_mul_quant
2025-05-07T20:31:46.6869423Z             if compiled:
2025-05-07T20:31:46.6869770Z                 op = torch.compile(op)
2025-05-07T20:31:46.6870182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6870573Z     
2025-05-07T20:31:46.6870844Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.6871258Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.6871671Z     
2025-05-07T20:31:46.6872007Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6872477Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.6872894Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.6873332Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.6873836Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.6874263Z     
2025-05-07T20:31:46.6874549Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.6874824Z 
2025-05-07T20:31:46.6874967Z moe/activation_test.py:126: 
2025-05-07T20:31:46.6875390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6875856Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.6876462Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.6877647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.6878691Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.6879445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.6880402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.6881401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.6882383Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.6883442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.6884494Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.6886769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.6887684Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.6888525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.6889239Z     fn()
2025-05-07T20:31:46.6890218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.6891047Z     self.fn.run(
2025-05-07T20:31:46.6891701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.6892435Z     kernel = self.compile(
2025-05-07T20:31:46.6893200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.6894122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.6894671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6894996Z 
2025-05-07T20:31:46.6895286Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7e5e5e10>
2025-05-07T20:31:46.6896782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.6898578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7e568550>}
2025-05-07T20:31:46.6900361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.6901706Z context = <triton._C.libtriton.ir.context object at 0x7f1c7f7ccf30>
2025-05-07T20:31:46.6902071Z 
2025-05-07T20:31:46.6902269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.6902944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.6903528Z                            module_map=module_map)
2025-05-07T20:31:46.6904004Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.6904458Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.6904802Z E       ^
2025-05-07T20:31:46.6905409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.6906022Z 
2025-05-07T20:31:46.6906581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.6907505Z 
2025-05-07T20:31:46.6907639Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.6908261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6908780Z     T=2048,
2025-05-07T20:31:46.6909013Z     D=5120,
2025-05-07T20:31:46.6909249Z     scale_ub=1200.0,
2025-05-07T20:31:46.6909513Z     contiguous=True,
2025-05-07T20:31:46.6909792Z     compiled=False,
2025-05-07T20:31:46.6910079Z )
2025-05-07T20:31:46.6910461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.6911060Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.6911390Z 
2025-05-07T20:31:46.6911491Z     @given(
2025-05-07T20:31:46.6911763Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6912138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.6912506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.6912905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.6913309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.6913654Z     )
2025-05-07T20:31:46.6914086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.6914689Z     def test_silu_mul_quant(
2025-05-07T20:31:46.6915032Z         self,
2025-05-07T20:31:46.6915306Z         T: int,
2025-05-07T20:31:46.6915540Z         D: int,
2025-05-07T20:31:46.6915804Z         scale_ub: Optional[float],
2025-05-07T20:31:46.6916134Z         contiguous: bool,
2025-05-07T20:31:46.6916451Z         compiled: bool,
2025-05-07T20:31:46.6916748Z     ) -> None:
2025-05-07T20:31:46.6917012Z         torch.manual_seed(2025)
2025-05-07T20:31:46.6917302Z     
2025-05-07T20:31:46.6917632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6918050Z     
2025-05-07T20:31:46.6918276Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.6918629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.6919014Z         x = x_sign * x_clamp
2025-05-07T20:31:46.6919325Z         x0 = x[:, :D]
2025-05-07T20:31:46.6919586Z         x1 = x[:, D:]
2025-05-07T20:31:46.6919847Z     
2025-05-07T20:31:46.6920074Z         if contiguous:
2025-05-07T20:31:46.6920350Z             x0 = x0.contiguous()
2025-05-07T20:31:46.6920664Z             x1 = x1.contiguous()
2025-05-07T20:31:46.6920957Z     
2025-05-07T20:31:46.6921216Z         if scale_ub is not None:
2025-05-07T20:31:46.6921599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.6922067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.6922502Z             )
2025-05-07T20:31:46.6943567Z         else:
2025-05-07T20:31:46.6943940Z             scale_ub_tensor = None
2025-05-07T20:31:46.6944285Z     
2025-05-07T20:31:46.6944588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6945027Z             op = silu_mul_quant
2025-05-07T20:31:46.6945392Z             if compiled:
2025-05-07T20:31:46.6945775Z                 op = torch.compile(op)
2025-05-07T20:31:46.6946208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6946615Z     
2025-05-07T20:31:46.6946907Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.6947147Z 
2025-05-07T20:31:46.6947304Z moe/activation_test.py:117: 
2025-05-07T20:31:46.6947722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6948189Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.6948593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6949526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.6950455Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.6951187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.6952101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.6953274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.6954022Z     kernel = self.compile(
2025-05-07T20:31:46.6954761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.6955654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.6956196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6956522Z 
2025-05-07T20:31:46.6956806Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7eb247c0>
2025-05-07T20:31:46.6958314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.6960245Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f67b250>}
2025-05-07T20:31:46.6962113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.6963549Z context = <triton._C.libtriton.ir.context object at 0x7f1c7e62f430>
2025-05-07T20:31:46.6963963Z 
2025-05-07T20:31:46.6964195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.6964931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.6965583Z                            module_map=module_map)
2025-05-07T20:31:46.6966099Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.6966583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.6966962Z E       ^
2025-05-07T20:31:46.6967616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.6968261Z 
2025-05-07T20:31:46.6968869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.6969616Z 
2025-05-07T20:31:46.6969767Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.6970316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.6970888Z     T=2048,
2025-05-07T20:31:46.6971158Z     D=5120,
2025-05-07T20:31:46.6971428Z     scale_ub=1200.0,
2025-05-07T20:31:46.6971747Z     contiguous=True,
2025-05-07T20:31:46.6972058Z     compiled=True,
2025-05-07T20:31:46.6972350Z )
2025-05-07T20:31:46.6972799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.6973486Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.6973861Z 
2025-05-07T20:31:46.6973976Z     @given(
2025-05-07T20:31:46.6974288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.6974736Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.6975167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.6975624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.6976093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.6976499Z     )
2025-05-07T20:31:46.6976984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.6977607Z     def test_silu_mul_quant(
2025-05-07T20:31:46.6977952Z         self,
2025-05-07T20:31:46.6978226Z         T: int,
2025-05-07T20:31:46.6978494Z         D: int,
2025-05-07T20:31:46.6978805Z         scale_ub: Optional[float],
2025-05-07T20:31:46.6979191Z         contiguous: bool,
2025-05-07T20:31:46.6979526Z         compiled: bool,
2025-05-07T20:31:46.6980012Z     ) -> None:
2025-05-07T20:31:46.6980435Z         torch.manual_seed(2025)
2025-05-07T20:31:46.6980786Z     
2025-05-07T20:31:46.6981259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.6981734Z     
2025-05-07T20:31:46.6982000Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.6982395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.6982822Z         x = x_sign * x_clamp
2025-05-07T20:31:46.6983144Z         x0 = x[:, :D]
2025-05-07T20:31:46.6983461Z         x1 = x[:, D:]
2025-05-07T20:31:46.6983752Z     
2025-05-07T20:31:46.6984010Z         if contiguous:
2025-05-07T20:31:46.6984339Z             x0 = x0.contiguous()
2025-05-07T20:31:46.6984691Z             x1 = x1.contiguous()
2025-05-07T20:31:46.6985020Z     
2025-05-07T20:31:46.6985274Z         if scale_ub is not None:
2025-05-07T20:31:46.6985651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.6986071Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.6986454Z             )
2025-05-07T20:31:46.6986685Z         else:
2025-05-07T20:31:46.6986959Z             scale_ub_tensor = None
2025-05-07T20:31:46.6987268Z     
2025-05-07T20:31:46.6987555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6987959Z             op = silu_mul_quant
2025-05-07T20:31:46.6988263Z             if compiled:
2025-05-07T20:31:46.6988579Z                 op = torch.compile(op)
2025-05-07T20:31:46.6988951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.6989292Z     
2025-05-07T20:31:46.6989552Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.6990210Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.6990606Z     
2025-05-07T20:31:46.6990896Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.6991295Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.6991650Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.6992053Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.6992542Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.6992967Z     
2025-05-07T20:31:46.6993241Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.6993511Z 
2025-05-07T20:31:46.6993644Z moe/activation_test.py:126: 
2025-05-07T20:31:46.6994042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.6994507Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.6994953Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.6995984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.6996977Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.6997725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.6998660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.6999607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7000599Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7001612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7002635Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7003619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7004514Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7005348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7006057Z     fn()
2025-05-07T20:31:46.7006734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7008582Z     self.fn.run(
2025-05-07T20:31:46.7009391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7010157Z     kernel = self.compile(
2025-05-07T20:31:46.7010916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7011818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7012365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7012690Z 
2025-05-07T20:31:46.7012979Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7e567ca0>
2025-05-07T20:31:46.7014456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7016371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f67a950>}
2025-05-07T20:31:46.7018204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7019555Z context = <triton._C.libtriton.ir.context object at 0x7f1c56ef9e30>
2025-05-07T20:31:46.7020048Z 
2025-05-07T20:31:46.7020274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7020969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7021587Z                            module_map=module_map)
2025-05-07T20:31:46.7022054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7022521Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7022874Z E       ^
2025-05-07T20:31:46.7023528Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7024191Z 
2025-05-07T20:31:46.7024789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7025536Z 
2025-05-07T20:31:46.7025689Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7026240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7026755Z     T=16384,
2025-05-07T20:31:46.7027012Z     D=7168,
2025-05-07T20:31:46.7027274Z     scale_ub=1200.0,
2025-05-07T20:31:46.7027563Z     contiguous=False,
2025-05-07T20:31:46.7027857Z     compiled=False,
2025-05-07T20:31:46.7028130Z )
2025-05-07T20:31:46.7028540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7029194Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7029565Z 
2025-05-07T20:31:46.7029672Z     @given(
2025-05-07T20:31:46.7029972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7030370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7030767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7031198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7031620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7031996Z     )
2025-05-07T20:31:46.7032452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7033024Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7033341Z         self,
2025-05-07T20:31:46.7033593Z         T: int,
2025-05-07T20:31:46.7033849Z         D: int,
2025-05-07T20:31:46.7034138Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7034496Z         contiguous: bool,
2025-05-07T20:31:46.7034910Z         compiled: bool,
2025-05-07T20:31:46.7035204Z     ) -> None:
2025-05-07T20:31:46.7035603Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7035925Z     
2025-05-07T20:31:46.7036270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7036714Z     
2025-05-07T20:31:46.7036969Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7037336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7037745Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7038060Z         x0 = x[:, :D]
2025-05-07T20:31:46.7038345Z         x1 = x[:, D:]
2025-05-07T20:31:46.7038623Z     
2025-05-07T20:31:46.7038875Z         if contiguous:
2025-05-07T20:31:46.7039175Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7039522Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7039851Z     
2025-05-07T20:31:46.7040110Z         if scale_ub is not None:
2025-05-07T20:31:46.7040486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7040966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7041389Z             )
2025-05-07T20:31:46.7041664Z         else:
2025-05-07T20:31:46.7041955Z             scale_ub_tensor = None
2025-05-07T20:31:46.7042304Z     
2025-05-07T20:31:46.7042622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7043060Z             op = silu_mul_quant
2025-05-07T20:31:46.7043414Z             if compiled:
2025-05-07T20:31:46.7043755Z                 op = torch.compile(op)
2025-05-07T20:31:46.7044173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7044519Z     
2025-05-07T20:31:46.7044762Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7044989Z 
2025-05-07T20:31:46.7045122Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7045482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7045906Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7046306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7047300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7048269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7048959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7049926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7050807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7051453Z     kernel = self.compile(
2025-05-07T20:31:46.7052113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7052916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7053396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7053679Z 
2025-05-07T20:31:46.7053924Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7f6eb7f0>
2025-05-07T20:31:46.7055250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7057013Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7e5be4d0>}
2025-05-07T20:31:46.7058859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7060378Z context = <triton._C.libtriton.ir.context object at 0x7f1c56f3a370>
2025-05-07T20:31:46.7060784Z 
2025-05-07T20:31:46.7061125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7061946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7062604Z                            module_map=module_map)
2025-05-07T20:31:46.7063081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7063511Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7063867Z E       ^
2025-05-07T20:31:46.7064525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7065167Z 
2025-05-07T20:31:46.7065754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7066474Z 
2025-05-07T20:31:46.7066620Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7067189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7067747Z     T=1,
2025-05-07T20:31:46.7067996Z     D=7168,
2025-05-07T20:31:46.7068270Z     scale_ub=None,
2025-05-07T20:31:46.7068574Z     contiguous=True,
2025-05-07T20:31:46.7068877Z     compiled=True,
2025-05-07T20:31:46.7069163Z )
2025-05-07T20:31:46.7069607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7070263Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7070616Z 
2025-05-07T20:31:46.7070719Z     @given(
2025-05-07T20:31:46.7071027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7071440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7071850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7072294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7072734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7073130Z     )
2025-05-07T20:31:46.7073610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7074224Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7074547Z         self,
2025-05-07T20:31:46.7074819Z         T: int,
2025-05-07T20:31:46.7075092Z         D: int,
2025-05-07T20:31:46.7075385Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7075755Z         contiguous: bool,
2025-05-07T20:31:46.7076085Z         compiled: bool,
2025-05-07T20:31:46.7076379Z     ) -> None:
2025-05-07T20:31:46.7076666Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7076992Z     
2025-05-07T20:31:46.7077342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7077793Z     
2025-05-07T20:31:46.7078050Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7078433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7078846Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7079168Z         x0 = x[:, :D]
2025-05-07T20:31:46.7079452Z         x1 = x[:, D:]
2025-05-07T20:31:46.7079740Z     
2025-05-07T20:31:46.7080003Z         if contiguous:
2025-05-07T20:31:46.7080332Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7080692Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7081035Z     
2025-05-07T20:31:46.7081308Z         if scale_ub is not None:
2025-05-07T20:31:46.7081691Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7082158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7082592Z             )
2025-05-07T20:31:46.7082864Z         else:
2025-05-07T20:31:46.7083161Z             scale_ub_tensor = None
2025-05-07T20:31:46.7083514Z     
2025-05-07T20:31:46.7083833Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7084279Z             op = silu_mul_quant
2025-05-07T20:31:46.7084638Z             if compiled:
2025-05-07T20:31:46.7084984Z                 op = torch.compile(op)
2025-05-07T20:31:46.7085403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7085793Z     
2025-05-07T20:31:46.7086174Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7086571Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7086989Z     
2025-05-07T20:31:46.7087394Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7087838Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7088231Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7088651Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7089123Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7089544Z     
2025-05-07T20:31:46.7089818Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7090294Z 
2025-05-07T20:31:46.7090436Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7090829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7091289Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7091736Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7092821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7093857Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7094595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7095530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7096461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7097446Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7098467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7099477Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7100545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7101425Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7102248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7102951Z     fn()
2025-05-07T20:31:46.7103677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7104553Z     self.fn.run(
2025-05-07T20:31:46.7105233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7105961Z     kernel = self.compile(
2025-05-07T20:31:46.7106749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7107658Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7108205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7108532Z 
2025-05-07T20:31:46.7108809Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c84748bb0>
2025-05-07T20:31:46.7110290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7112208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cb1f90160>}
2025-05-07T20:31:46.7114061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7115669Z context = <triton._C.libtriton.ir.context object at 0x7f1c6cca43f0>
2025-05-07T20:31:46.7116059Z 
2025-05-07T20:31:46.7116427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7117167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7117801Z                            module_map=module_map)
2025-05-07T20:31:46.7118278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7118751Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7119114Z E       ^
2025-05-07T20:31:46.7119778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7120445Z 
2025-05-07T20:31:46.7121037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7121768Z 
2025-05-07T20:31:46.7121909Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7122482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7123023Z     T=4096,
2025-05-07T20:31:46.7123289Z     D=5120,
2025-05-07T20:31:46.7123551Z     scale_ub=None,
2025-05-07T20:31:46.7123838Z     contiguous=False,
2025-05-07T20:31:46.7124151Z     compiled=False,
2025-05-07T20:31:46.7124435Z )
2025-05-07T20:31:46.7124857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7125541Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7125917Z 
2025-05-07T20:31:46.7126032Z     @given(
2025-05-07T20:31:46.7126338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7126769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7127189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7127639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7128086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7128488Z     )
2025-05-07T20:31:46.7128978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7129571Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7129904Z         self,
2025-05-07T20:31:46.7130168Z         T: int,
2025-05-07T20:31:46.7130430Z         D: int,
2025-05-07T20:31:46.7130733Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7131104Z         contiguous: bool,
2025-05-07T20:31:46.7131432Z         compiled: bool,
2025-05-07T20:31:46.7131742Z     ) -> None:
2025-05-07T20:31:46.7132044Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7132367Z     
2025-05-07T20:31:46.7132731Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7133189Z     
2025-05-07T20:31:46.7133450Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7133835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7134253Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7134602Z         x0 = x[:, :D]
2025-05-07T20:31:46.7134902Z         x1 = x[:, D:]
2025-05-07T20:31:46.7135198Z     
2025-05-07T20:31:46.7135465Z         if contiguous:
2025-05-07T20:31:46.7135787Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7136153Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7136533Z     
2025-05-07T20:31:46.7136818Z         if scale_ub is not None:
2025-05-07T20:31:46.7137215Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7137689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7138123Z             )
2025-05-07T20:31:46.7138394Z         else:
2025-05-07T20:31:46.7138692Z             scale_ub_tensor = None
2025-05-07T20:31:46.7139046Z     
2025-05-07T20:31:46.7139376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7139928Z             op = silu_mul_quant
2025-05-07T20:31:46.7140274Z             if compiled:
2025-05-07T20:31:46.7140616Z                 op = torch.compile(op)
2025-05-07T20:31:46.7141145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7141518Z     
2025-05-07T20:31:46.7141780Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7142098Z 
2025-05-07T20:31:46.7142228Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7142625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7143078Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7143480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7144443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7145377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7146095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7147010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7147900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7148648Z     kernel = self.compile(
2025-05-07T20:31:46.7158916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7159818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7160370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7160685Z 
2025-05-07T20:31:46.7160980Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c7f7bf700>
2025-05-07T20:31:46.7162457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7164344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f67ab90>}
2025-05-07T20:31:46.7166199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7167632Z context = <triton._C.libtriton.ir.context object at 0x7f1c6cd08ab0>
2025-05-07T20:31:46.7168029Z 
2025-05-07T20:31:46.7168273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7169015Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7169688Z                            module_map=module_map)
2025-05-07T20:31:46.7170209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7170707Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7171066Z E       ^
2025-05-07T20:31:46.7171714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7172339Z 
2025-05-07T20:31:46.7172928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7173640Z 
2025-05-07T20:31:46.7173782Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7174345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7174908Z     T=4096,
2025-05-07T20:31:46.7175174Z     D=7168,
2025-05-07T20:31:46.7175436Z     scale_ub=None,
2025-05-07T20:31:46.7175741Z     contiguous=False,
2025-05-07T20:31:46.7176058Z     compiled=False,
2025-05-07T20:31:46.7176344Z )
2025-05-07T20:31:46.7176820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7177525Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7177906Z 
2025-05-07T20:31:46.7178013Z     @given(
2025-05-07T20:31:46.7178341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7178951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7179469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7180058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7180528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7180931Z     )
2025-05-07T20:31:46.7181418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7182033Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7182374Z         self,
2025-05-07T20:31:46.7182641Z         T: int,
2025-05-07T20:31:46.7182926Z         D: int,
2025-05-07T20:31:46.7183229Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7183595Z         contiguous: bool,
2025-05-07T20:31:46.7183928Z         compiled: bool,
2025-05-07T20:31:46.7184242Z     ) -> None:
2025-05-07T20:31:46.7184536Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7184882Z     
2025-05-07T20:31:46.7185282Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7185772Z     
2025-05-07T20:31:46.7186073Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7186514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7186991Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7187339Z         x0 = x[:, :D]
2025-05-07T20:31:46.7187647Z         x1 = x[:, D:]
2025-05-07T20:31:46.7187949Z     
2025-05-07T20:31:46.7188206Z         if contiguous:
2025-05-07T20:31:46.7188524Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7188882Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7189215Z     
2025-05-07T20:31:46.7189497Z         if scale_ub is not None:
2025-05-07T20:31:46.7190117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7190577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7191003Z             )
2025-05-07T20:31:46.7191265Z         else:
2025-05-07T20:31:46.7191552Z             scale_ub_tensor = None
2025-05-07T20:31:46.7191907Z     
2025-05-07T20:31:46.7192219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7192643Z             op = silu_mul_quant
2025-05-07T20:31:46.7192982Z             if compiled:
2025-05-07T20:31:46.7193313Z                 op = torch.compile(op)
2025-05-07T20:31:46.7193699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7194074Z     
2025-05-07T20:31:46.7194331Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7194557Z 
2025-05-07T20:31:46.7194688Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7195082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7195520Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7195895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7196855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7197841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7198582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7199502Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7200285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7200819Z     kernel = self.compile(
2025-05-07T20:31:46.7201350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7202006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7202400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7202627Z 
2025-05-07T20:31:46.7202832Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c84bbe440>
2025-05-07T20:31:46.7204130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7205614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c84cdf370>}
2025-05-07T20:31:46.7206945Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7207957Z context = <triton._C.libtriton.ir.context object at 0x7f1c6ccc75b0>
2025-05-07T20:31:46.7208239Z 
2025-05-07T20:31:46.7208404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7208922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7209392Z                            module_map=module_map)
2025-05-07T20:31:46.7209763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7210121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7210382Z E       ^
2025-05-07T20:31:46.7210848Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7211292Z 
2025-05-07T20:31:46.7211703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7212225Z 
2025-05-07T20:31:46.7212331Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7212744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7213140Z     T=128,
2025-05-07T20:31:46.7213324Z     D=7168,
2025-05-07T20:31:46.7213522Z     scale_ub=None,
2025-05-07T20:31:46.7213741Z     contiguous=False,
2025-05-07T20:31:46.7213967Z     compiled=True,
2025-05-07T20:31:46.7214183Z )
2025-05-07T20:31:46.7214507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7215005Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7215283Z 
2025-05-07T20:31:46.7215362Z     @given(
2025-05-07T20:31:46.7215600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7215908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7216217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7216544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7216874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7217154Z     )
2025-05-07T20:31:46.7217509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7217950Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7218187Z         self,
2025-05-07T20:31:46.7218392Z         T: int,
2025-05-07T20:31:46.7218600Z         D: int,
2025-05-07T20:31:46.7218816Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7219085Z         contiguous: bool,
2025-05-07T20:31:46.7219329Z         compiled: bool,
2025-05-07T20:31:46.7219552Z     ) -> None:
2025-05-07T20:31:46.7219768Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7220125Z     
2025-05-07T20:31:46.7220392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7220729Z     
2025-05-07T20:31:46.7220926Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7221214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7221518Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7221757Z         x0 = x[:, :D]
2025-05-07T20:31:46.7221975Z         x1 = x[:, D:]
2025-05-07T20:31:46.7222176Z     
2025-05-07T20:31:46.7222362Z         if contiguous:
2025-05-07T20:31:46.7222595Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7222847Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7223086Z     
2025-05-07T20:31:46.7223378Z         if scale_ub is not None:
2025-05-07T20:31:46.7223650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7224055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7224369Z             )
2025-05-07T20:31:46.7224561Z         else:
2025-05-07T20:31:46.7224778Z             scale_ub_tensor = None
2025-05-07T20:31:46.7225026Z     
2025-05-07T20:31:46.7225255Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7225569Z             op = silu_mul_quant
2025-05-07T20:31:46.7225822Z             if compiled:
2025-05-07T20:31:46.7226070Z                 op = torch.compile(op)
2025-05-07T20:31:46.7226360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7226637Z     
2025-05-07T20:31:46.7226835Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7227115Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7227407Z     
2025-05-07T20:31:46.7227644Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7227982Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7228284Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7228602Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7228953Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7229262Z     
2025-05-07T20:31:46.7229469Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7229662Z 
2025-05-07T20:31:46.7229771Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7230066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7230405Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7230734Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7231529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7232280Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7232836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7233511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7234187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7234901Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7235645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7236410Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7237154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7237790Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7238396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7238901Z     fn()
2025-05-07T20:31:46.7239406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7239984Z     self.fn.run(
2025-05-07T20:31:46.7240449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7240970Z     kernel = self.compile(
2025-05-07T20:31:46.7241513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7242171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7242559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7242790Z 
2025-05-07T20:31:46.7242996Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6cdc14e0>
2025-05-07T20:31:46.7244268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7245641Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7f7af9a0>}
2025-05-07T20:31:46.7246977Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7247984Z context = <triton._C.libtriton.ir.context object at 0x7f1c56805a70>
2025-05-07T20:31:46.7248275Z 
2025-05-07T20:31:46.7248441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7248968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7249440Z                            module_map=module_map)
2025-05-07T20:31:46.7249801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7250155Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7250422Z E       ^
2025-05-07T20:31:46.7250877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7251324Z 
2025-05-07T20:31:46.7251735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7252247Z 
2025-05-07T20:31:46.7252352Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7252767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7253166Z     T=128,
2025-05-07T20:31:46.7253361Z     D=7168,
2025-05-07T20:31:46.7253565Z     scale_ub=None,
2025-05-07T20:31:46.7253779Z     contiguous=False,
2025-05-07T20:31:46.7254012Z     compiled=False,
2025-05-07T20:31:46.7254220Z )
2025-05-07T20:31:46.7254542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7255031Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7255294Z 
2025-05-07T20:31:46.7255379Z     @given(
2025-05-07T20:31:46.7255608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7255925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7256235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7256562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7256883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7257169Z     )
2025-05-07T20:31:46.7257522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7257960Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7258208Z         self,
2025-05-07T20:31:46.7258405Z         T: int,
2025-05-07T20:31:46.7258598Z         D: int,
2025-05-07T20:31:46.7258823Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7259096Z         contiguous: bool,
2025-05-07T20:31:46.7259330Z         compiled: bool,
2025-05-07T20:31:46.7259556Z     ) -> None:
2025-05-07T20:31:46.7259775Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7260083Z     
2025-05-07T20:31:46.7260358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7260696Z     
2025-05-07T20:31:46.7260894Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7261179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7261489Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7261729Z         x0 = x[:, :D]
2025-05-07T20:31:46.7261942Z         x1 = x[:, D:]
2025-05-07T20:31:46.7262153Z     
2025-05-07T20:31:46.7262341Z         if contiguous:
2025-05-07T20:31:46.7262570Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7262921Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7263159Z     
2025-05-07T20:31:46.7263427Z         if scale_ub is not None:
2025-05-07T20:31:46.7263704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7264039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7264338Z             )
2025-05-07T20:31:46.7264535Z         else:
2025-05-07T20:31:46.7264747Z             scale_ub_tensor = None
2025-05-07T20:31:46.7264990Z     
2025-05-07T20:31:46.7265222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7265542Z             op = silu_mul_quant
2025-05-07T20:31:46.7265794Z             if compiled:
2025-05-07T20:31:46.7266041Z                 op = torch.compile(op)
2025-05-07T20:31:46.7266343Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7266663Z     
2025-05-07T20:31:46.7266857Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7267029Z 
2025-05-07T20:31:46.7267130Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7267438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7267770Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7268061Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7268749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7269441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7269972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7270655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7271311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7271841Z     kernel = self.compile(
2025-05-07T20:31:46.7272377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7273044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7273438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7273671Z 
2025-05-07T20:31:46.7273879Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c56cd7c70>
2025-05-07T20:31:46.7274951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7276308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c84b2edd0>}
2025-05-07T20:31:46.7277637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7278680Z context = <triton._C.libtriton.ir.context object at 0x7f1c6ca4fcb0>
2025-05-07T20:31:46.7278971Z 
2025-05-07T20:31:46.7279140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7279660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7280118Z                            module_map=module_map)
2025-05-07T20:31:46.7280487Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7280844Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7281098Z E       ^
2025-05-07T20:31:46.7281567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7282017Z 
2025-05-07T20:31:46.7282430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7283031Z 
2025-05-07T20:31:46.7283143Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7283624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7284030Z     T=4096,
2025-05-07T20:31:46.7284227Z     D=5120,
2025-05-07T20:31:46.7284417Z     scale_ub=1200.0,
2025-05-07T20:31:46.7284650Z     contiguous=True,
2025-05-07T20:31:46.7284874Z     compiled=False,
2025-05-07T20:31:46.7285087Z )
2025-05-07T20:31:46.7285404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7285895Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.7286163Z 
2025-05-07T20:31:46.7286249Z     @given(
2025-05-07T20:31:46.7286473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7286787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7287093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7287424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7287750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7288045Z     )
2025-05-07T20:31:46.7288388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7288829Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7289075Z         self,
2025-05-07T20:31:46.7289272Z         T: int,
2025-05-07T20:31:46.7289466Z         D: int,
2025-05-07T20:31:46.7289686Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7290233Z         contiguous: bool,
2025-05-07T20:31:46.7290534Z         compiled: bool,
2025-05-07T20:31:46.7290755Z     ) -> None:
2025-05-07T20:31:46.7290973Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7291209Z     
2025-05-07T20:31:46.7291481Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7291818Z     
2025-05-07T20:31:46.7292014Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7292309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7292625Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7292860Z         x0 = x[:, :D]
2025-05-07T20:31:46.7293081Z         x1 = x[:, D:]
2025-05-07T20:31:46.7293292Z     
2025-05-07T20:31:46.7293474Z         if contiguous:
2025-05-07T20:31:46.7293706Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7293966Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7294207Z     
2025-05-07T20:31:46.7294395Z         if scale_ub is not None:
2025-05-07T20:31:46.7294671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7295004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7295302Z             )
2025-05-07T20:31:46.7295499Z         else:
2025-05-07T20:31:46.7295715Z             scale_ub_tensor = None
2025-05-07T20:31:46.7295961Z     
2025-05-07T20:31:46.7296195Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7296507Z             op = silu_mul_quant
2025-05-07T20:31:46.7296760Z             if compiled:
2025-05-07T20:31:46.7297008Z                 op = torch.compile(op)
2025-05-07T20:31:46.7297312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7297582Z     
2025-05-07T20:31:46.7297777Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7297941Z 
2025-05-07T20:31:46.7298045Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7298336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7298659Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7298946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7299632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7300410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7300944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7301623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7302642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7303167Z     kernel = self.compile(
2025-05-07T20:31:46.7303706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7304362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7304748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7304977Z 
2025-05-07T20:31:46.7305183Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6c9851b0>
2025-05-07T20:31:46.7306256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7307692Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7c14c940>}
2025-05-07T20:31:46.7309021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7310029Z context = <triton._C.libtriton.ir.context object at 0x7f1c562c36b0>
2025-05-07T20:31:46.7310322Z 
2025-05-07T20:31:46.7310488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7311003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7311464Z                            module_map=module_map)
2025-05-07T20:31:46.7311823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7312174Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7312436Z E       ^
2025-05-07T20:31:46.7312897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7313347Z 
2025-05-07T20:31:46.7313764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7314273Z 
2025-05-07T20:31:46.7314376Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7314781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7315168Z     T=1,
2025-05-07T20:31:46.7315355Z     D=5120,
2025-05-07T20:31:46.7315550Z     scale_ub=None,
2025-05-07T20:31:46.7315757Z     contiguous=True,
2025-05-07T20:31:46.7315980Z     compiled=True,
2025-05-07T20:31:46.7316181Z )
2025-05-07T20:31:46.7316492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7316969Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7317226Z 
2025-05-07T20:31:46.7317315Z     @given(
2025-05-07T20:31:46.7317545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7317857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7318162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7318493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7318811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7319101Z     )
2025-05-07T20:31:46.7319453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7319884Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7320126Z         self,
2025-05-07T20:31:46.7320318Z         T: int,
2025-05-07T20:31:46.7320531Z         D: int,
2025-05-07T20:31:46.7326794Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7326894Z         contiguous: bool,
2025-05-07T20:31:46.7326994Z         compiled: bool,
2025-05-07T20:31:46.7327078Z     ) -> None:
2025-05-07T20:31:46.7327299Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7327386Z     
2025-05-07T20:31:46.7327642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7327721Z     
2025-05-07T20:31:46.7327825Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7327956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7328050Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7328142Z         x0 = x[:, :D]
2025-05-07T20:31:46.7328225Z         x1 = x[:, D:]
2025-05-07T20:31:46.7328302Z     
2025-05-07T20:31:46.7328401Z         if contiguous:
2025-05-07T20:31:46.7328496Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7328595Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7328672Z     
2025-05-07T20:31:46.7328765Z         if scale_ub is not None:
2025-05-07T20:31:46.7328885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7329026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7329111Z             )
2025-05-07T20:31:46.7329203Z         else:
2025-05-07T20:31:46.7329302Z             scale_ub_tensor = None
2025-05-07T20:31:46.7329385Z     
2025-05-07T20:31:46.7329529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7329623Z             op = silu_mul_quant
2025-05-07T20:31:46.7329711Z             if compiled:
2025-05-07T20:31:46.7329825Z                 op = torch.compile(op)
2025-05-07T20:31:46.7329933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7330015Z     
2025-05-07T20:31:46.7330110Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7330234Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7330318Z     
2025-05-07T20:31:46.7330460Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7330566Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7330676Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7330802Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7330952Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7331037Z     
2025-05-07T20:31:46.7331145Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7331150Z 
2025-05-07T20:31:46.7331260Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7331398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7331507Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7331651Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7332225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7332331Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7332703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7332928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7333319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7333580Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7333983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7334243Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7334617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7334800Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7335148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7335228Z     fn()
2025-05-07T20:31:46.7335733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7335893Z     self.fn.run(
2025-05-07T20:31:46.7336237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7336341Z     kernel = self.compile(
2025-05-07T20:31:46.7336722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7336908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7337039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7337044Z 
2025-05-07T20:31:46.7337253Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6c948640>
2025-05-07T20:31:46.7338035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7338555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c7e5bea70>}
2025-05-07T20:31:46.7339308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7339503Z context = <triton._C.libtriton.ir.context object at 0x7f1c56319eb0>
2025-05-07T20:31:46.7339508Z 
2025-05-07T20:31:46.7339677Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7340057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7340168Z                            module_map=module_map)
2025-05-07T20:31:46.7340343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7340453Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7340534Z E       ^
2025-05-07T20:31:46.7340901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7340906Z 
2025-05-07T20:31:46.7341326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7341331Z 
2025-05-07T20:31:46.7341443Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7341667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7341747Z     T=2048,
2025-05-07T20:31:46.7341831Z     D=5120,
2025-05-07T20:31:46.7341918Z     scale_ub=None,
2025-05-07T20:31:46.7342005Z     contiguous=True,
2025-05-07T20:31:46.7342095Z     compiled=True,
2025-05-07T20:31:46.7342174Z )
2025-05-07T20:31:46.7342391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7342574Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7342587Z 
2025-05-07T20:31:46.7342667Z     @given(
2025-05-07T20:31:46.7342799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7342901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7343018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7343144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7343263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7343343Z     )
2025-05-07T20:31:46.7343599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7343697Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7343777Z         self,
2025-05-07T20:31:46.7343869Z         T: int,
2025-05-07T20:31:46.7343949Z         D: int,
2025-05-07T20:31:46.7344051Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7344246Z         contiguous: bool,
2025-05-07T20:31:46.7344336Z         compiled: bool,
2025-05-07T20:31:46.7344427Z     ) -> None:
2025-05-07T20:31:46.7344602Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7344682Z     
2025-05-07T20:31:46.7344867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7344948Z     
2025-05-07T20:31:46.7345047Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7345183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7345276Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7345360Z         x0 = x[:, :D]
2025-05-07T20:31:46.7345452Z         x1 = x[:, D:]
2025-05-07T20:31:46.7345529Z     
2025-05-07T20:31:46.7345616Z         if contiguous:
2025-05-07T20:31:46.7345723Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7345819Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7345903Z     
2025-05-07T20:31:46.7345999Z         if scale_ub is not None:
2025-05-07T20:31:46.7346108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7346265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7346346Z             )
2025-05-07T20:31:46.7346432Z         else:
2025-05-07T20:31:46.7346539Z             scale_ub_tensor = None
2025-05-07T20:31:46.7346616Z     
2025-05-07T20:31:46.7346750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7346850Z             op = silu_mul_quant
2025-05-07T20:31:46.7346940Z             if compiled:
2025-05-07T20:31:46.7347042Z                 op = torch.compile(op)
2025-05-07T20:31:46.7347158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7347236Z     
2025-05-07T20:31:46.7347338Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7347465Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7347540Z     
2025-05-07T20:31:46.7347686Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7347790Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7347897Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7348029Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7348172Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7348250Z     
2025-05-07T20:31:46.7348353Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7348358Z 
2025-05-07T20:31:46.7348470Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7348599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7348707Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7348849Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7349406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7349517Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7349878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7350114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7350485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7350742Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7351143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7351394Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7351767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7351939Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7352280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7352448Z     fn()
2025-05-07T20:31:46.7352980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7353064Z     self.fn.run(
2025-05-07T20:31:46.7353412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7353510Z     kernel = self.compile(
2025-05-07T20:31:46.7353894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7354077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7354206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7354210Z 
2025-05-07T20:31:46.7354422Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c6c9865c0>
2025-05-07T20:31:46.7355200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7355712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c6c9f8dc0>}
2025-05-07T20:31:46.7356461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7356654Z context = <triton._C.libtriton.ir.context object at 0x7f1c561210f0>
2025-05-07T20:31:46.7356659Z 
2025-05-07T20:31:46.7356832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7357099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7357212Z                            module_map=module_map)
2025-05-07T20:31:46.7357387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7357492Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7357571Z E       ^
2025-05-07T20:31:46.7357935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7357940Z 
2025-05-07T20:31:46.7358360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7358365Z 
2025-05-07T20:31:46.7358477Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7358701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7358782Z     T=128,
2025-05-07T20:31:46.7358867Z     D=5120,
2025-05-07T20:31:46.7358952Z     scale_ub=None,
2025-05-07T20:31:46.7359045Z     contiguous=True,
2025-05-07T20:31:46.7359130Z     compiled=True,
2025-05-07T20:31:46.7359212Z )
2025-05-07T20:31:46.7359436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7359614Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7359619Z 
2025-05-07T20:31:46.7359698Z     @given(
2025-05-07T20:31:46.7359826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7359932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7360049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7360174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7360291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7360375Z     )
2025-05-07T20:31:46.7360629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7360726Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7360811Z         self,
2025-05-07T20:31:46.7360891Z         T: int,
2025-05-07T20:31:46.7361058Z         D: int,
2025-05-07T20:31:46.7361166Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7361259Z         contiguous: bool,
2025-05-07T20:31:46.7361422Z         compiled: bool,
2025-05-07T20:31:46.7361511Z     ) -> None:
2025-05-07T20:31:46.7361609Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7361684Z     
2025-05-07T20:31:46.7361858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7361933Z     
2025-05-07T20:31:46.7362033Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7362160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7362251Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7362341Z         x0 = x[:, :D]
2025-05-07T20:31:46.7362423Z         x1 = x[:, D:]
2025-05-07T20:31:46.7362498Z     
2025-05-07T20:31:46.7362589Z         if contiguous:
2025-05-07T20:31:46.7362684Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7362775Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7362856Z     
2025-05-07T20:31:46.7362956Z         if scale_ub is not None:
2025-05-07T20:31:46.7363062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7363211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7363291Z             )
2025-05-07T20:31:46.7363371Z         else:
2025-05-07T20:31:46.7363474Z             scale_ub_tensor = None
2025-05-07T20:31:46.7363549Z     
2025-05-07T20:31:46.7363687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7363779Z             op = silu_mul_quant
2025-05-07T20:31:46.7363867Z             if compiled:
2025-05-07T20:31:46.7363976Z                 op = torch.compile(op)
2025-05-07T20:31:46.7364083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7364158Z     
2025-05-07T20:31:46.7364259Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7364383Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7364458Z     
2025-05-07T20:31:46.7364603Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7364713Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7364818Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7364948Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7365089Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7365171Z     
2025-05-07T20:31:46.7365272Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7365277Z 
2025-05-07T20:31:46.7365376Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7365512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7365619Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7365756Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7366320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7366424Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7366805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7367032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7367399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7367660Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7368063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7368321Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7368693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7368860Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7369372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7369455Z     fn()
2025-05-07T20:31:46.7369859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7369950Z     self.fn.run(
2025-05-07T20:31:46.7370292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7370396Z     kernel = self.compile(
2025-05-07T20:31:46.7370783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7370960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7371095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7371100Z 
2025-05-07T20:31:46.7371304Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c562c59c0>
2025-05-07T20:31:46.7372098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7372608Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c564dcb80>}
2025-05-07T20:31:46.7373346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7373542Z context = <triton._C.libtriton.ir.context object at 0x7f1c55cfc030>
2025-05-07T20:31:46.7373547Z 
2025-05-07T20:31:46.7373712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7373990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7374104Z                            module_map=module_map)
2025-05-07T20:31:46.7374269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7374378Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7374458Z E       ^
2025-05-07T20:31:46.7374819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7374824Z 
2025-05-07T20:31:46.7375237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7375242Z 
2025-05-07T20:31:46.7375347Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7375577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7375657Z     T=4096,
2025-05-07T20:31:46.7375735Z     D=5120,
2025-05-07T20:31:46.7375834Z     scale_ub=None,
2025-05-07T20:31:46.7375922Z     contiguous=True,
2025-05-07T20:31:46.7376017Z     compiled=True,
2025-05-07T20:31:46.7376093Z )
2025-05-07T20:31:46.7376315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7376494Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7376499Z 
2025-05-07T20:31:46.7376577Z     @given(
2025-05-07T20:31:46.7376699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7376806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7376923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7377042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7377164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7377241Z     )
2025-05-07T20:31:46.7377497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7377592Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7377757Z         self,
2025-05-07T20:31:46.7377843Z         T: int,
2025-05-07T20:31:46.7377921Z         D: int,
2025-05-07T20:31:46.7378102Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7378203Z         contiguous: bool,
2025-05-07T20:31:46.7378291Z         compiled: bool,
2025-05-07T20:31:46.7378372Z     ) -> None:
2025-05-07T20:31:46.7378478Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7378553Z     
2025-05-07T20:31:46.7378723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7378808Z     
2025-05-07T20:31:46.7378901Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7379032Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7379124Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7379206Z         x0 = x[:, :D]
2025-05-07T20:31:46.7379298Z         x1 = x[:, D:]
2025-05-07T20:31:46.7379372Z     
2025-05-07T20:31:46.7379458Z         if contiguous:
2025-05-07T20:31:46.7379559Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7379655Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7379729Z     
2025-05-07T20:31:46.7379903Z         if scale_ub is not None:
2025-05-07T20:31:46.7380011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7380150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7380234Z             )
2025-05-07T20:31:46.7380313Z         else:
2025-05-07T20:31:46.7380411Z             scale_ub_tensor = None
2025-05-07T20:31:46.7380491Z     
2025-05-07T20:31:46.7380623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7380723Z             op = silu_mul_quant
2025-05-07T20:31:46.7380810Z             if compiled:
2025-05-07T20:31:46.7380916Z                 op = torch.compile(op)
2025-05-07T20:31:46.7381030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7381104Z     
2025-05-07T20:31:46.7381198Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7381327Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7381408Z     
2025-05-07T20:31:46.7381546Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7381659Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7381761Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7381885Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7382032Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7382108Z     
2025-05-07T20:31:46.7382216Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7382220Z 
2025-05-07T20:31:46.7382321Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7382450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7382564Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7382698Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7383257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7383373Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7383739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7383967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7384340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7384601Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7385005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7385258Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7385637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7385923Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7386350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7386440Z     fn()
2025-05-07T20:31:46.7386839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7386927Z     self.fn.run(
2025-05-07T20:31:46.7387270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7387366Z     kernel = self.compile(
2025-05-07T20:31:46.7387753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7387929Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7388057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7388068Z 
2025-05-07T20:31:46.7388288Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c56072ec0>
2025-05-07T20:31:46.7389061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7389566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5658c0d0>}
2025-05-07T20:31:46.7390683Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7390888Z context = <triton._C.libtriton.ir.context object at 0x7f1c554f15f0>
2025-05-07T20:31:46.7390900Z 
2025-05-07T20:31:46.7391069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7391342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7391456Z                            module_map=module_map)
2025-05-07T20:31:46.7391623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7391728Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7391815Z E       ^
2025-05-07T20:31:46.7392172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7392177Z 
2025-05-07T20:31:46.7392604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7392609Z 
2025-05-07T20:31:46.7392715Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7392939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7393032Z     T=16384,
2025-05-07T20:31:46.7393111Z     D=5120,
2025-05-07T20:31:46.7393197Z     scale_ub=None,
2025-05-07T20:31:46.7393295Z     contiguous=True,
2025-05-07T20:31:46.7393384Z     compiled=True,
2025-05-07T20:31:46.7393461Z )
2025-05-07T20:31:46.7393687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7393862Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7393867Z 
2025-05-07T20:31:46.7393954Z     @given(
2025-05-07T20:31:46.7394076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7394180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7394303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7394422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7394539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7394625Z     )
2025-05-07T20:31:46.7394872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7395219Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7395406Z         self,
2025-05-07T20:31:46.7395487Z         T: int,
2025-05-07T20:31:46.7395573Z         D: int,
2025-05-07T20:31:46.7395674Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7395767Z         contiguous: bool,
2025-05-07T20:31:46.7395868Z         compiled: bool,
2025-05-07T20:31:46.7395948Z     ) -> None:
2025-05-07T20:31:46.7396045Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7396126Z     
2025-05-07T20:31:46.7396299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7396375Z     
2025-05-07T20:31:46.7396474Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7396601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7396699Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7396783Z         x0 = x[:, :D]
2025-05-07T20:31:46.7396864Z         x1 = x[:, D:]
2025-05-07T20:31:46.7396949Z     
2025-05-07T20:31:46.7397035Z         if contiguous:
2025-05-07T20:31:46.7397129Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7397232Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7397307Z     
2025-05-07T20:31:46.7397400Z         if scale_ub is not None:
2025-05-07T20:31:46.7397512Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7397651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7397729Z             )
2025-05-07T20:31:46.7397815Z         else:
2025-05-07T20:31:46.7397912Z             scale_ub_tensor = None
2025-05-07T20:31:46.7397988Z     
2025-05-07T20:31:46.7398125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7398217Z             op = silu_mul_quant
2025-05-07T20:31:46.7398311Z             if compiled:
2025-05-07T20:31:46.7398413Z                 op = torch.compile(op)
2025-05-07T20:31:46.7398522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7398603Z     
2025-05-07T20:31:46.7398704Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7398829Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7398913Z     
2025-05-07T20:31:46.7399052Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7399156Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7399263Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7399387Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7399534Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7399610Z     
2025-05-07T20:31:46.7399711Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7399715Z 
2025-05-07T20:31:46.7399819Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7399950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7400056Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7400198Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7400767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7400878Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7401244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7401465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7401843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7402101Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7402503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7402762Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7403388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7403564Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7403906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7403985Z     fn()
2025-05-07T20:31:46.7404397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7404483Z     self.fn.run(
2025-05-07T20:31:46.7404828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7404929Z     kernel = self.compile(
2025-05-07T20:31:46.7405315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7405499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7405639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7405643Z 
2025-05-07T20:31:46.7405854Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c552c2050>
2025-05-07T20:31:46.7406682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7407185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5663a3b0>}
2025-05-07T20:31:46.7407933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7408131Z context = <triton._C.libtriton.ir.context object at 0x7f1c54d5bcb0>
2025-05-07T20:31:46.7408136Z 
2025-05-07T20:31:46.7408313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7408580Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7408687Z                            module_map=module_map)
2025-05-07T20:31:46.7408856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7408963Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7409042Z E       ^
2025-05-07T20:31:46.7409406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7409411Z 
2025-05-07T20:31:46.7409824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7409829Z 
2025-05-07T20:31:46.7409940Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7410167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7410252Z     T=1,
2025-05-07T20:31:46.7410338Z     D=5120,
2025-05-07T20:31:46.7410424Z     scale_ub=1200.0,
2025-05-07T20:31:46.7410513Z     contiguous=True,
2025-05-07T20:31:46.7410608Z     compiled=True,
2025-05-07T20:31:46.7410685Z )
2025-05-07T20:31:46.7410905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7411081Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7411086Z 
2025-05-07T20:31:46.7411165Z     @given(
2025-05-07T20:31:46.7411293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7411395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7411512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7411640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7411757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7411927Z     )
2025-05-07T20:31:46.7412251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7412348Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7412431Z         self,
2025-05-07T20:31:46.7412509Z         T: int,
2025-05-07T20:31:46.7412586Z         D: int,
2025-05-07T20:31:46.7412690Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7412783Z         contiguous: bool,
2025-05-07T20:31:46.7412871Z         compiled: bool,
2025-05-07T20:31:46.7412959Z     ) -> None:
2025-05-07T20:31:46.7413057Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7413131Z     
2025-05-07T20:31:46.7413307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7413383Z     
2025-05-07T20:31:46.7413477Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7413611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7413703Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7413799Z         x0 = x[:, :D]
2025-05-07T20:31:46.7413882Z         x1 = x[:, D:]
2025-05-07T20:31:46.7413959Z     
2025-05-07T20:31:46.7414055Z         if contiguous:
2025-05-07T20:31:46.7414147Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7414244Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7414319Z     
2025-05-07T20:31:46.7414411Z         if scale_ub is not None:
2025-05-07T20:31:46.7414523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7414661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7414740Z             )
2025-05-07T20:31:46.7414827Z         else:
2025-05-07T20:31:46.7414926Z             scale_ub_tensor = None
2025-05-07T20:31:46.7415003Z     
2025-05-07T20:31:46.7415141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7415233Z             op = silu_mul_quant
2025-05-07T20:31:46.7415320Z             if compiled:
2025-05-07T20:31:46.7415428Z                 op = torch.compile(op)
2025-05-07T20:31:46.7415543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7415626Z     
2025-05-07T20:31:46.7415726Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7415731Z 
2025-05-07T20:31:46.7415832Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7415973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7416079Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7416183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7416558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7416656Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7417166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7417268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7417627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7417871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7418217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7418313Z     kernel = self.compile(
2025-05-07T20:31:46.7418707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7418885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7419020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7419024Z 
2025-05-07T20:31:46.7419230Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54e75c60>
2025-05-07T20:31:46.7420069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7420763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5524feb0>}
2025-05-07T20:31:46.7421518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7421718Z context = <triton._C.libtriton.ir.context object at 0x7f1c54dd3cb0>
2025-05-07T20:31:46.7421722Z 
2025-05-07T20:31:46.7421892Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7422165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7422272Z                            module_map=module_map)
2025-05-07T20:31:46.7422437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7422552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7422631Z E       ^
2025-05-07T20:31:46.7422995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7423000Z 
2025-05-07T20:31:46.7423420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7423424Z 
2025-05-07T20:31:46.7423531Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7423758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7423838Z     T=1,
2025-05-07T20:31:46.7423916Z     D=5120,
2025-05-07T20:31:46.7424005Z     scale_ub=None,
2025-05-07T20:31:46.7424094Z     contiguous=False,
2025-05-07T20:31:46.7424181Z     compiled=True,
2025-05-07T20:31:46.7424263Z )
2025-05-07T20:31:46.7424480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7424652Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7424662Z 
2025-05-07T20:31:46.7424747Z     @given(
2025-05-07T20:31:46.7424866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7424973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7425087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7425206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7425327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7425405Z     )
2025-05-07T20:31:46.7425657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7425759Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7425837Z         self,
2025-05-07T20:31:46.7425918Z         T: int,
2025-05-07T20:31:46.7426004Z         D: int,
2025-05-07T20:31:46.7426105Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7426207Z         contiguous: bool,
2025-05-07T20:31:46.7426300Z         compiled: bool,
2025-05-07T20:31:46.7426382Z     ) -> None:
2025-05-07T20:31:46.7426490Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7426566Z     
2025-05-07T20:31:46.7426737Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7426821Z     
2025-05-07T20:31:46.7426916Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7427041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7427139Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7427223Z         x0 = x[:, :D]
2025-05-07T20:31:46.7427305Z         x1 = x[:, D:]
2025-05-07T20:31:46.7427387Z     
2025-05-07T20:31:46.7427473Z         if contiguous:
2025-05-07T20:31:46.7427568Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7427664Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7427739Z     
2025-05-07T20:31:46.7427839Z         if scale_ub is not None:
2025-05-07T20:31:46.7427946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7428168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7428255Z             )
2025-05-07T20:31:46.7428412Z         else:
2025-05-07T20:31:46.7428511Z             scale_ub_tensor = None
2025-05-07T20:31:46.7428593Z     
2025-05-07T20:31:46.7428725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7428817Z             op = silu_mul_quant
2025-05-07T20:31:46.7428911Z             if compiled:
2025-05-07T20:31:46.7429014Z                 op = torch.compile(op)
2025-05-07T20:31:46.7429123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7429205Z     
2025-05-07T20:31:46.7429298Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7429430Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7429504Z     
2025-05-07T20:31:46.7429642Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7429752Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7429853Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7429983Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7430137Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7430216Z     
2025-05-07T20:31:46.7430317Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7430322Z 
2025-05-07T20:31:46.7430431Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7430561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7430673Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7430808Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7431367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7431477Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7431836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7432068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7432447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7432702Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7433113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7433365Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7433746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7433924Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7434271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7434362Z     fn()
2025-05-07T20:31:46.7434769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7434856Z     self.fn.run(
2025-05-07T20:31:46.7435199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7435296Z     kernel = self.compile(
2025-05-07T20:31:46.7435674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7435856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7435986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7435990Z 
2025-05-07T20:31:46.7436201Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54e74ca0>
2025-05-07T20:31:46.7437059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7437646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c564de290>}
2025-05-07T20:31:46.7438388Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7438582Z context = <triton._C.libtriton.ir.context object at 0x7f1c54a17570>
2025-05-07T20:31:46.7438587Z 
2025-05-07T20:31:46.7438762Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7439024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7439144Z                            module_map=module_map)
2025-05-07T20:31:46.7439309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7439418Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7439505Z E       ^
2025-05-07T20:31:46.7439859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7439864Z 
2025-05-07T20:31:46.7440276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7440287Z 
2025-05-07T20:31:46.7440393Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7440615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7440702Z     T=1,
2025-05-07T20:31:46.7440782Z     D=5120,
2025-05-07T20:31:46.7440866Z     scale_ub=None,
2025-05-07T20:31:46.7440958Z     contiguous=True,
2025-05-07T20:31:46.7441044Z     compiled=False,
2025-05-07T20:31:46.7441125Z )
2025-05-07T20:31:46.7441349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7441518Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.7441523Z 
2025-05-07T20:31:46.7441602Z     @given(
2025-05-07T20:31:46.7441730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7441833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7441958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7442076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7442191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7442274Z     )
2025-05-07T20:31:46.7442525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7442622Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7442708Z         self,
2025-05-07T20:31:46.7442787Z         T: int,
2025-05-07T20:31:46.7442865Z         D: int,
2025-05-07T20:31:46.7442978Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7443071Z         contiguous: bool,
2025-05-07T20:31:46.7443173Z         compiled: bool,
2025-05-07T20:31:46.7443254Z     ) -> None:
2025-05-07T20:31:46.7443349Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7443430Z     
2025-05-07T20:31:46.7443601Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7443678Z     
2025-05-07T20:31:46.7443776Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7443903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7443993Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7444080Z         x0 = x[:, :D]
2025-05-07T20:31:46.7444162Z         x1 = x[:, D:]
2025-05-07T20:31:46.7444237Z     
2025-05-07T20:31:46.7444329Z         if contiguous:
2025-05-07T20:31:46.7444425Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7444516Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7444597Z     
2025-05-07T20:31:46.7444777Z         if scale_ub is not None:
2025-05-07T20:31:46.7444892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7445130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7445212Z             )
2025-05-07T20:31:46.7445296Z         else:
2025-05-07T20:31:46.7445393Z             scale_ub_tensor = None
2025-05-07T20:31:46.7445468Z     
2025-05-07T20:31:46.7445606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7445698Z             op = silu_mul_quant
2025-05-07T20:31:46.7445784Z             if compiled:
2025-05-07T20:31:46.7445895Z                 op = torch.compile(op)
2025-05-07T20:31:46.7446003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7446080Z     
2025-05-07T20:31:46.7446182Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7446187Z 
2025-05-07T20:31:46.7446287Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7446424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7446534Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7446634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7447152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7447257Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7447621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7447849Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7448191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7448294Z     kernel = self.compile(
2025-05-07T20:31:46.7448675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7448856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7448996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7449004Z 
2025-05-07T20:31:46.7449213Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c552b2590>
2025-05-07T20:31:46.7449988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7450486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c5524fb50>}
2025-05-07T20:31:46.7451227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7451426Z context = <triton._C.libtriton.ir.context object at 0x7f1c54957e70>
2025-05-07T20:31:46.7451434Z 
2025-05-07T20:31:46.7451608Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7451883Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7451993Z                            module_map=module_map)
2025-05-07T20:31:46.7452157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7452272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7452353Z E       ^
2025-05-07T20:31:46.7452713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7452718Z 
2025-05-07T20:31:46.7453136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7453141Z 
2025-05-07T20:31:46.7453246Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7453560Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7453640Z     T=128,
2025-05-07T20:31:46.7453791Z     D=5120,
2025-05-07T20:31:46.7453881Z     scale_ub=None,
2025-05-07T20:31:46.7453967Z     contiguous=False,
2025-05-07T20:31:46.7454078Z     compiled=True,
2025-05-07T20:31:46.7454157Z )
2025-05-07T20:31:46.7454377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7454554Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7454558Z 
2025-05-07T20:31:46.7460426Z     @given(
2025-05-07T20:31:46.7460583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7460689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7460808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7460935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7461051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7461144Z     )
2025-05-07T20:31:46.7461402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7461508Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7461590Z         self,
2025-05-07T20:31:46.7461677Z         T: int,
2025-05-07T20:31:46.7461757Z         D: int,
2025-05-07T20:31:46.7461860Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7461962Z         contiguous: bool,
2025-05-07T20:31:46.7462051Z         compiled: bool,
2025-05-07T20:31:46.7462147Z     ) -> None:
2025-05-07T20:31:46.7462245Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7462322Z     
2025-05-07T20:31:46.7462505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7462585Z     
2025-05-07T20:31:46.7462681Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7462815Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7462909Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7462993Z         x0 = x[:, :D]
2025-05-07T20:31:46.7463094Z         x1 = x[:, D:]
2025-05-07T20:31:46.7463169Z     
2025-05-07T20:31:46.7463258Z         if contiguous:
2025-05-07T20:31:46.7463366Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7463458Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7463545Z     
2025-05-07T20:31:46.7463644Z         if scale_ub is not None:
2025-05-07T20:31:46.7463752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7463900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7463981Z             )
2025-05-07T20:31:46.7464060Z         else:
2025-05-07T20:31:46.7464166Z             scale_ub_tensor = None
2025-05-07T20:31:46.7464243Z     
2025-05-07T20:31:46.7464377Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7464481Z             op = silu_mul_quant
2025-05-07T20:31:46.7464569Z             if compiled:
2025-05-07T20:31:46.7464673Z                 op = torch.compile(op)
2025-05-07T20:31:46.7464791Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7464872Z     
2025-05-07T20:31:46.7464969Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7464987Z 
2025-05-07T20:31:46.7465092Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7465225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7465338Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7465442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7465816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7465923Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7466417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7466526Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7466892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7467246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7467667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7467767Z     kernel = self.compile(
2025-05-07T20:31:46.7468150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7468339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7468468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7468473Z 
2025-05-07T20:31:46.7468685Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c56537f40>
2025-05-07T20:31:46.7469459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7469977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c557bdc60>}
2025-05-07T20:31:46.7470727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7470919Z context = <triton._C.libtriton.ir.context object at 0x7f1c549f16f0>
2025-05-07T20:31:46.7470925Z 
2025-05-07T20:31:46.7471098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7471367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7471481Z                            module_map=module_map)
2025-05-07T20:31:46.7471653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7471760Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7471848Z E       ^
2025-05-07T20:31:46.7472205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7472209Z 
2025-05-07T20:31:46.7472627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7472632Z 
2025-05-07T20:31:46.7472744Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7472970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7473057Z     T=128,
2025-05-07T20:31:46.7473138Z     D=7168,
2025-05-07T20:31:46.7473223Z     scale_ub=1200.0,
2025-05-07T20:31:46.7473317Z     contiguous=False,
2025-05-07T20:31:46.7473404Z     compiled=False,
2025-05-07T20:31:46.7473489Z )
2025-05-07T20:31:46.7473711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7473895Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7473900Z 
2025-05-07T20:31:46.7473980Z     @given(
2025-05-07T20:31:46.7474115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7474216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7474341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7474459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7474573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7474658Z     )
2025-05-07T20:31:46.7474909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7475007Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7475094Z         self,
2025-05-07T20:31:46.7475174Z         T: int,
2025-05-07T20:31:46.7475254Z         D: int,
2025-05-07T20:31:46.7475363Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7475456Z         contiguous: bool,
2025-05-07T20:31:46.7475630Z         compiled: bool,
2025-05-07T20:31:46.7475721Z     ) -> None:
2025-05-07T20:31:46.7475821Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7475983Z     
2025-05-07T20:31:46.7476158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7476237Z     
2025-05-07T20:31:46.7476340Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7476466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7476560Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7476651Z         x0 = x[:, :D]
2025-05-07T20:31:46.7476733Z         x1 = x[:, D:]
2025-05-07T20:31:46.7476811Z     
2025-05-07T20:31:46.7476911Z         if contiguous:
2025-05-07T20:31:46.7477010Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7477107Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7477191Z     
2025-05-07T20:31:46.7477286Z         if scale_ub is not None:
2025-05-07T20:31:46.7477396Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7477549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7477629Z             )
2025-05-07T20:31:46.7477718Z         else:
2025-05-07T20:31:46.7477823Z             scale_ub_tensor = None
2025-05-07T20:31:46.7477900Z     
2025-05-07T20:31:46.7478039Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7478132Z             op = silu_mul_quant
2025-05-07T20:31:46.7478220Z             if compiled:
2025-05-07T20:31:46.7478331Z                 op = torch.compile(op)
2025-05-07T20:31:46.7478440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7478516Z     
2025-05-07T20:31:46.7478619Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7478624Z 
2025-05-07T20:31:46.7478724Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7478861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7478970Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7479072Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7479594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7479697Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7480056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7480287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7480629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7480736Z     kernel = self.compile(
2025-05-07T20:31:46.7481123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7481302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7481428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7481437Z 
2025-05-07T20:31:46.7481655Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54a065f0>
2025-05-07T20:31:46.7482429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7482943Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c557bf9a0>}
2025-05-07T20:31:46.7483685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7483878Z context = <triton._C.libtriton.ir.context object at 0x7f1c5489c5b0>
2025-05-07T20:31:46.7483883Z 
2025-05-07T20:31:46.7484058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7484511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7484629Z                            module_map=module_map)
2025-05-07T20:31:46.7484795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7484898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7484982Z E       ^
2025-05-07T20:31:46.7485338Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7485343Z 
2025-05-07T20:31:46.7485756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7485768Z 
2025-05-07T20:31:46.7485873Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7486093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7486180Z     T=128,
2025-05-07T20:31:46.7486263Z     D=5120,
2025-05-07T20:31:46.7486348Z     scale_ub=None,
2025-05-07T20:31:46.7486445Z     contiguous=False,
2025-05-07T20:31:46.7486557Z     compiled=False,
2025-05-07T20:31:46.7486637Z )
2025-05-07T20:31:46.7486885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7487058Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7487063Z 
2025-05-07T20:31:46.7487151Z     @given(
2025-05-07T20:31:46.7487269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7487370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7487495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7487612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7487729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7487812Z     )
2025-05-07T20:31:46.7488058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7488160Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7488245Z         self,
2025-05-07T20:31:46.7488329Z         T: int,
2025-05-07T20:31:46.7488408Z         D: int,
2025-05-07T20:31:46.7488519Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7488614Z         contiguous: bool,
2025-05-07T20:31:46.7488707Z         compiled: bool,
2025-05-07T20:31:46.7488788Z     ) -> None:
2025-05-07T20:31:46.7488885Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7488967Z     
2025-05-07T20:31:46.7489136Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7489214Z     
2025-05-07T20:31:46.7489313Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7489439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7489531Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7489620Z         x0 = x[:, :D]
2025-05-07T20:31:46.7489703Z         x1 = x[:, D:]
2025-05-07T20:31:46.7489780Z     
2025-05-07T20:31:46.7490232Z         if contiguous:
2025-05-07T20:31:46.7490365Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7490502Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7490585Z     
2025-05-07T20:31:46.7490678Z         if scale_ub is not None:
2025-05-07T20:31:46.7490792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7490930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7491009Z             )
2025-05-07T20:31:46.7491094Z         else:
2025-05-07T20:31:46.7491191Z             scale_ub_tensor = None
2025-05-07T20:31:46.7491271Z     
2025-05-07T20:31:46.7491412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7491506Z             op = silu_mul_quant
2025-05-07T20:31:46.7491594Z             if compiled:
2025-05-07T20:31:46.7491703Z                 op = torch.compile(op)
2025-05-07T20:31:46.7491812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7491894Z     
2025-05-07T20:31:46.7491989Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7492218Z 
2025-05-07T20:31:46.7492321Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7492573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7492681Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7492784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7493296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7493397Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7493754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7493981Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7494321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7494423Z     kernel = self.compile(
2025-05-07T20:31:46.7494818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7495001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7495134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7495139Z 
2025-05-07T20:31:46.7495345Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c55982830>
2025-05-07T20:31:46.7496117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7496617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c564df370>}
2025-05-07T20:31:46.7497368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7497562Z context = <triton._C.libtriton.ir.context object at 0x7f1c548aea30>
2025-05-07T20:31:46.7497567Z 
2025-05-07T20:31:46.7497734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7498003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7498113Z                            module_map=module_map)
2025-05-07T20:31:46.7498276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7498384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7498464Z E       ^
2025-05-07T20:31:46.7498822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7498826Z 
2025-05-07T20:31:46.7499237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7499246Z 
2025-05-07T20:31:46.7499359Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7499587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7499668Z     T=128,
2025-05-07T20:31:46.7499755Z     D=5120,
2025-05-07T20:31:46.7499904Z     scale_ub=1200.0,
2025-05-07T20:31:46.7499994Z     contiguous=True,
2025-05-07T20:31:46.7500087Z     compiled=False,
2025-05-07T20:31:46.7500164Z )
2025-05-07T20:31:46.7500380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7500557Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.7500561Z 
2025-05-07T20:31:46.7500638Z     @given(
2025-05-07T20:31:46.7500759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7500866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7501073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7501201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7501388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7501468Z     )
2025-05-07T20:31:46.7501723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7501820Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7501895Z         self,
2025-05-07T20:31:46.7501976Z         T: int,
2025-05-07T20:31:46.7502053Z         D: int,
2025-05-07T20:31:46.7502153Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7502249Z         contiguous: bool,
2025-05-07T20:31:46.7502335Z         compiled: bool,
2025-05-07T20:31:46.7502414Z     ) -> None:
2025-05-07T20:31:46.7502517Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7502588Z     
2025-05-07T20:31:46.7502755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7502836Z     
2025-05-07T20:31:46.7502933Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7503061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7503157Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7503240Z         x0 = x[:, :D]
2025-05-07T20:31:46.7503328Z         x1 = x[:, D:]
2025-05-07T20:31:46.7503402Z     
2025-05-07T20:31:46.7503486Z         if contiguous:
2025-05-07T20:31:46.7503585Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7503678Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7503754Z     
2025-05-07T20:31:46.7503856Z         if scale_ub is not None:
2025-05-07T20:31:46.7503962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7504097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7504180Z             )
2025-05-07T20:31:46.7504256Z         else:
2025-05-07T20:31:46.7504357Z             scale_ub_tensor = None
2025-05-07T20:31:46.7504433Z     
2025-05-07T20:31:46.7504563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7504668Z             op = silu_mul_quant
2025-05-07T20:31:46.7504755Z             if compiled:
2025-05-07T20:31:46.7504861Z                 op = torch.compile(op)
2025-05-07T20:31:46.7504976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7505051Z     
2025-05-07T20:31:46.7505142Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7505146Z 
2025-05-07T20:31:46.7505251Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7505379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7505486Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7505587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7506087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7506192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7506547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7506774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7507121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7507216Z     kernel = self.compile(
2025-05-07T20:31:46.7507608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7507781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7507906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7507911Z 
2025-05-07T20:31:46.7508126Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54a06a10>
2025-05-07T20:31:46.7508891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7509562Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55825000>}
2025-05-07T20:31:46.7510302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7510492Z context = <triton._C.libtriton.ir.context object at 0x7f1c548d8970>
2025-05-07T20:31:46.7510504Z 
2025-05-07T20:31:46.7510669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7510928Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7511039Z                            module_map=module_map)
2025-05-07T20:31:46.7511202Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7511308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7511393Z E       ^
2025-05-07T20:31:46.7511746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7511751Z 
2025-05-07T20:31:46.7512171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7512176Z 
2025-05-07T20:31:46.7512281Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7512499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7512580Z     T=1,
2025-05-07T20:31:46.7512657Z     D=7168,
2025-05-07T20:31:46.7512741Z     scale_ub=1200.0,
2025-05-07T20:31:46.7512832Z     contiguous=True,
2025-05-07T20:31:46.7512914Z     compiled=True,
2025-05-07T20:31:46.7512987Z )
2025-05-07T20:31:46.7513207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7513376Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7513381Z 
2025-05-07T20:31:46.7513468Z     @given(
2025-05-07T20:31:46.7513589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7513688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7513807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7513926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7514042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7514123Z     )
2025-05-07T20:31:46.7514366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7514462Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7514542Z         self,
2025-05-07T20:31:46.7514620Z         T: int,
2025-05-07T20:31:46.7514703Z         D: int,
2025-05-07T20:31:46.7514802Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7514892Z         contiguous: bool,
2025-05-07T20:31:46.7514990Z         compiled: bool,
2025-05-07T20:31:46.7515070Z     ) -> None:
2025-05-07T20:31:46.7515166Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7515253Z     
2025-05-07T20:31:46.7515423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7515494Z     
2025-05-07T20:31:46.7515598Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7515722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7515813Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7515899Z         x0 = x[:, :D]
2025-05-07T20:31:46.7515981Z         x1 = x[:, D:]
2025-05-07T20:31:46.7516064Z     
2025-05-07T20:31:46.7516149Z         if contiguous:
2025-05-07T20:31:46.7516244Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7516341Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7516426Z     
2025-05-07T20:31:46.7516528Z         if scale_ub is not None:
2025-05-07T20:31:46.7516660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7516908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7516984Z             )
2025-05-07T20:31:46.7517064Z         else:
2025-05-07T20:31:46.7517234Z             scale_ub_tensor = None
2025-05-07T20:31:46.7517310Z     
2025-05-07T20:31:46.7517444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7517535Z             op = silu_mul_quant
2025-05-07T20:31:46.7517622Z             if compiled:
2025-05-07T20:31:46.7517730Z                 op = torch.compile(op)
2025-05-07T20:31:46.7517836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7517919Z     
2025-05-07T20:31:46.7518012Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7518017Z 
2025-05-07T20:31:46.7518118Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7518255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7518357Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7518455Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7518831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7518933Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7519426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7519526Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7519879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7520104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7520442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7520536Z     kernel = self.compile(
2025-05-07T20:31:46.7520918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7521100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7521236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7521240Z 
2025-05-07T20:31:46.7521445Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c55c463b0>
2025-05-07T20:31:46.7522214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7522713Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55825360>}
2025-05-07T20:31:46.7523449Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7523651Z context = <triton._C.libtriton.ir.context object at 0x7f1c54b044b0>
2025-05-07T20:31:46.7523661Z 
2025-05-07T20:31:46.7523827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7524099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7524204Z                            module_map=module_map)
2025-05-07T20:31:46.7524367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7524471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7524549Z E       ^
2025-05-07T20:31:46.7524899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7524903Z 
2025-05-07T20:31:46.7525321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7525326Z 
2025-05-07T20:31:46.7525515Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7525812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7525891Z     T=1,
2025-05-07T20:31:46.7525968Z     D=7168,
2025-05-07T20:31:46.7526057Z     scale_ub=1200.0,
2025-05-07T20:31:46.7526145Z     contiguous=False,
2025-05-07T20:31:46.7526228Z     compiled=True,
2025-05-07T20:31:46.7526310Z )
2025-05-07T20:31:46.7526523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7526689Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7526700Z 
2025-05-07T20:31:46.7526778Z     @given(
2025-05-07T20:31:46.7526896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7527000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7527116Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7527231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7527354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7527429Z     )
2025-05-07T20:31:46.7527678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7527779Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7527855Z         self,
2025-05-07T20:31:46.7527933Z         T: int,
2025-05-07T20:31:46.7528014Z         D: int,
2025-05-07T20:31:46.7528116Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7528213Z         contiguous: bool,
2025-05-07T20:31:46.7528300Z         compiled: bool,
2025-05-07T20:31:46.7528378Z     ) -> None:
2025-05-07T20:31:46.7528479Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7528554Z     
2025-05-07T20:31:46.7528722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7528800Z     
2025-05-07T20:31:46.7528892Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7529016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7529118Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7529200Z         x0 = x[:, :D]
2025-05-07T20:31:46.7529279Z         x1 = x[:, D:]
2025-05-07T20:31:46.7529360Z     
2025-05-07T20:31:46.7529445Z         if contiguous:
2025-05-07T20:31:46.7529535Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7529628Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7529701Z     
2025-05-07T20:31:46.7529795Z         if scale_ub is not None:
2025-05-07T20:31:46.7529900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7530033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7530115Z             )
2025-05-07T20:31:46.7530191Z         else:
2025-05-07T20:31:46.7530285Z             scale_ub_tensor = None
2025-05-07T20:31:46.7530365Z     
2025-05-07T20:31:46.7530495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7530585Z             op = silu_mul_quant
2025-05-07T20:31:46.7530677Z             if compiled:
2025-05-07T20:31:46.7530780Z                 op = torch.compile(op)
2025-05-07T20:31:46.7530886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7530967Z     
2025-05-07T20:31:46.7531067Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7531072Z 
2025-05-07T20:31:46.7531173Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7531301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7531400Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7531507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7531870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7531966Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7532463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7532563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7532928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7533398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7533743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7533848Z     kernel = self.compile(
2025-05-07T20:31:46.7534227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7534407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7534535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7534540Z 
2025-05-07T20:31:46.7534744Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54ac7850>
2025-05-07T20:31:46.7535517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7536033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55825630>}
2025-05-07T20:31:46.7536780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7536970Z context = <triton._C.libtriton.ir.context object at 0x7f1c54aca130>
2025-05-07T20:31:46.7536974Z 
2025-05-07T20:31:46.7537139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7537407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7537516Z                            module_map=module_map)
2025-05-07T20:31:46.7537693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7537795Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7537876Z E       ^
2025-05-07T20:31:46.7538233Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7538238Z 
2025-05-07T20:31:46.7538649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7538653Z 
2025-05-07T20:31:46.7538767Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7538988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7539066Z     T=1,
2025-05-07T20:31:46.7539147Z     D=7168,
2025-05-07T20:31:46.7539232Z     scale_ub=None,
2025-05-07T20:31:46.7539324Z     contiguous=False,
2025-05-07T20:31:46.7539415Z     compiled=True,
2025-05-07T20:31:46.7539494Z )
2025-05-07T20:31:46.7539708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7539961Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7539972Z 
2025-05-07T20:31:46.7540050Z     @given(
2025-05-07T20:31:46.7540171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7540276Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7540392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7540516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7540632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7540706Z     )
2025-05-07T20:31:46.7540955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7541049Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7541123Z         self,
2025-05-07T20:31:46.7541207Z         T: int,
2025-05-07T20:31:46.7541283Z         D: int,
2025-05-07T20:31:46.7541385Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7541571Z         contiguous: bool,
2025-05-07T20:31:46.7541657Z         compiled: bool,
2025-05-07T20:31:46.7541745Z     ) -> None:
2025-05-07T20:31:46.7541913Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7541990Z     
2025-05-07T20:31:46.7542169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7542243Z     
2025-05-07T20:31:46.7542336Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7542470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7542561Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7542648Z         x0 = x[:, :D]
2025-05-07T20:31:46.7542729Z         x1 = x[:, D:]
2025-05-07T20:31:46.7542801Z     
2025-05-07T20:31:46.7542895Z         if contiguous:
2025-05-07T20:31:46.7542989Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7543080Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7543163Z     
2025-05-07T20:31:46.7543256Z         if scale_ub is not None:
2025-05-07T20:31:46.7543362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7543513Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7543598Z             )
2025-05-07T20:31:46.7543685Z         else:
2025-05-07T20:31:46.7543784Z             scale_ub_tensor = None
2025-05-07T20:31:46.7543861Z     
2025-05-07T20:31:46.7544004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7544099Z             op = silu_mul_quant
2025-05-07T20:31:46.7544190Z             if compiled:
2025-05-07T20:31:46.7544303Z                 op = torch.compile(op)
2025-05-07T20:31:46.7544414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7544490Z     
2025-05-07T20:31:46.7544595Z         y_fp8, y_scale = fn()
2025-05-07T20:31:46.7544725Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:46.7544801Z     
2025-05-07T20:31:46.7544956Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7545064Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:46.7545181Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:46.7545318Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:46.7545472Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7545556Z     
2025-05-07T20:31:46.7545661Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:46.7545665Z 
2025-05-07T20:31:46.7545770Z moe/activation_test.py:126: 
2025-05-07T20:31:46.7545920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7546035Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:46.7546190Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:46.7546883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:46.7546991Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:46.7547435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7547705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7548148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:46.7548456Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7548940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:46.7549245Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:46.7549702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:46.7549888Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:46.7550309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:46.7550527Z     fn()
2025-05-07T20:31:46.7551004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:46.7551090Z     self.fn.run(
2025-05-07T20:31:46.7551425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7551526Z     kernel = self.compile(
2025-05-07T20:31:46.7551901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7552074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7552208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7552212Z 
2025-05-07T20:31:46.7552417Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54a86a10>
2025-05-07T20:31:46.7553199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7553709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55826440>}
2025-05-07T20:31:46.7554453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7554643Z context = <triton._C.libtriton.ir.context object at 0x7f1c547e8d30>
2025-05-07T20:31:46.7554648Z 
2025-05-07T20:31:46.7554812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7555079Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7555191Z                            module_map=module_map)
2025-05-07T20:31:46.7555360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7555472Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:46.7555548Z E       ^
2025-05-07T20:31:46.7555907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7555911Z 
2025-05-07T20:31:46.7556328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7556333Z 
2025-05-07T20:31:46.7556436Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7556662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7556740Z     T=1,
2025-05-07T20:31:46.7556821Z     D=5120,
2025-05-07T20:31:46.7556904Z     scale_ub=1200.0,
2025-05-07T20:31:46.7556993Z     contiguous=False,
2025-05-07T20:31:46.7557088Z     compiled=True,
2025-05-07T20:31:46.7557163Z )
2025-05-07T20:31:46.7557385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7557561Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7557566Z 
2025-05-07T20:31:46.7557642Z     @given(
2025-05-07T20:31:46.7557760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7557867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7557981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7558104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7558220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7558295Z     )
2025-05-07T20:31:46.7558546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7558641Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7558720Z         self,
2025-05-07T20:31:46.7558805Z         T: int,
2025-05-07T20:31:46.7558967Z         D: int,
2025-05-07T20:31:46.7559068Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7559243Z         contiguous: bool,
2025-05-07T20:31:46.7559331Z         compiled: bool,
2025-05-07T20:31:46.7559410Z     ) -> None:
2025-05-07T20:31:46.7559514Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7559590Z     
2025-05-07T20:31:46.7559764Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7559839Z     
2025-05-07T20:31:46.7559931Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7560061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7560150Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7560232Z         x0 = x[:, :D]
2025-05-07T20:31:46.7560320Z         x1 = x[:, D:]
2025-05-07T20:31:46.7560395Z     
2025-05-07T20:31:46.7560481Z         if contiguous:
2025-05-07T20:31:46.7560583Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7560674Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7560757Z     
2025-05-07T20:31:46.7560858Z         if scale_ub is not None:
2025-05-07T20:31:46.7560964Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7561107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7561193Z             )
2025-05-07T20:31:46.7561268Z         else:
2025-05-07T20:31:46.7561370Z             scale_ub_tensor = None
2025-05-07T20:31:46.7561445Z     
2025-05-07T20:31:46.7561575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7561674Z             op = silu_mul_quant
2025-05-07T20:31:46.7561761Z             if compiled:
2025-05-07T20:31:46.7561862Z                 op = torch.compile(op)
2025-05-07T20:31:46.7561981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7562053Z     
2025-05-07T20:31:46.7562145Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7562150Z 
2025-05-07T20:31:46.7562256Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7562384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7562495Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7562600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7562972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7563072Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7563561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7563661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7564020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7564243Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7564590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7564690Z     kernel = self.compile(
2025-05-07T20:31:46.7565072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7565254Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7565379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7565384Z 
2025-05-07T20:31:46.7565592Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54692c50>
2025-05-07T20:31:46.7566361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7566864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55076170>}
2025-05-07T20:31:46.7567694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7568037Z context = <triton._C.libtriton.ir.context object at 0x7f1c547593f0>
2025-05-07T20:31:46.7568043Z 
2025-05-07T20:31:46.7568219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7568481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7568589Z                            module_map=module_map)
2025-05-07T20:31:46.7568761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7568863Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7568944Z E       ^
2025-05-07T20:31:46.7569303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7569307Z 
2025-05-07T20:31:46.7569729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7569740Z 
2025-05-07T20:31:46.7569857Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7570080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7570160Z     T=1,
2025-05-07T20:31:46.7570247Z     D=5120,
2025-05-07T20:31:46.7570333Z     scale_ub=1200.0,
2025-05-07T20:31:46.7570428Z     contiguous=False,
2025-05-07T20:31:46.7570515Z     compiled=False,
2025-05-07T20:31:46.7570593Z )
2025-05-07T20:31:46.7570818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7570989Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7570993Z 
2025-05-07T20:31:46.7571074Z     @given(
2025-05-07T20:31:46.7571202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7571304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7571426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7571557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7571675Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7571760Z     )
2025-05-07T20:31:46.7572011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7572108Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7572194Z         self,
2025-05-07T20:31:46.7572273Z         T: int,
2025-05-07T20:31:46.7572353Z         D: int,
2025-05-07T20:31:46.7572462Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7572557Z         contiguous: bool,
2025-05-07T20:31:46.7572647Z         compiled: bool,
2025-05-07T20:31:46.7572734Z     ) -> None:
2025-05-07T20:31:46.7572831Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7572907Z     
2025-05-07T20:31:46.7573082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7573166Z     
2025-05-07T20:31:46.7573269Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7573395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7573492Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7573584Z         x0 = x[:, :D]
2025-05-07T20:31:46.7573668Z         x1 = x[:, D:]
2025-05-07T20:31:46.7573745Z     
2025-05-07T20:31:46.7573837Z         if contiguous:
2025-05-07T20:31:46.7573932Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7574026Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7574110Z     
2025-05-07T20:31:46.7574203Z         if scale_ub is not None:
2025-05-07T20:31:46.7574310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7574453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7574531Z             )
2025-05-07T20:31:46.7574610Z         else:
2025-05-07T20:31:46.7574718Z             scale_ub_tensor = None
2025-05-07T20:31:46.7574795Z     
2025-05-07T20:31:46.7574933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7575125Z             op = silu_mul_quant
2025-05-07T20:31:46.7575216Z             if compiled:
2025-05-07T20:31:46.7575435Z                 op = torch.compile(op)
2025-05-07T20:31:46.7575549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7575626Z     
2025-05-07T20:31:46.7575726Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7575730Z 
2025-05-07T20:31:46.7575830Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7575962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7576071Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7576172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7576678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7576780Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7577140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7577381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7577726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7577825Z     kernel = self.compile(
2025-05-07T20:31:46.7578214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7578388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7578527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7578531Z 
2025-05-07T20:31:46.7578734Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54750700>
2025-05-07T20:31:46.7579502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7580132Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55075e10>}
2025-05-07T20:31:46.7580876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7581075Z context = <triton._C.libtriton.ir.context object at 0x7f1c5474daf0>
2025-05-07T20:31:46.7581079Z 
2025-05-07T20:31:46.7581245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7581514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7581624Z                            module_map=module_map)
2025-05-07T20:31:46.7581789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7581904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7581986Z E       ^
2025-05-07T20:31:46.7582344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7582349Z 
2025-05-07T20:31:46.7582767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7582772Z 
2025-05-07T20:31:46.7582878Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7583110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7583190Z     T=16384,
2025-05-07T20:31:46.7583294Z     D=5120,
2025-05-07T20:31:46.7583380Z     scale_ub=1200.0,
2025-05-07T20:31:46.7583471Z     contiguous=False,
2025-05-07T20:31:46.7583566Z     compiled=True,
2025-05-07T20:31:46.7583645Z )
2025-05-07T20:31:46.7589610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7590323Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7590572Z 
2025-05-07T20:31:46.7590666Z     @given(
2025-05-07T20:31:46.7590798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7590910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7591033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7591152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7591279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7591359Z     )
2025-05-07T20:31:46.7591624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7591728Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7591815Z         self,
2025-05-07T20:31:46.7591906Z         T: int,
2025-05-07T20:31:46.7591987Z         D: int,
2025-05-07T20:31:46.7592091Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7592202Z         contiguous: bool,
2025-05-07T20:31:46.7592292Z         compiled: bool,
2025-05-07T20:31:46.7592380Z     ) -> None:
2025-05-07T20:31:46.7592492Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7592572Z     
2025-05-07T20:31:46.7592747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7592833Z     
2025-05-07T20:31:46.7592931Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7593069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7593163Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7593249Z         x0 = x[:, :D]
2025-05-07T20:31:46.7593342Z         x1 = x[:, D:]
2025-05-07T20:31:46.7593422Z     
2025-05-07T20:31:46.7593513Z         if contiguous:
2025-05-07T20:31:46.7593620Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7593714Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7593792Z     
2025-05-07T20:31:46.7593899Z         if scale_ub is not None:
2025-05-07T20:31:46.7594008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7594152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7594243Z             )
2025-05-07T20:31:46.7594330Z         else:
2025-05-07T20:31:46.7594437Z             scale_ub_tensor = None
2025-05-07T20:31:46.7594516Z     
2025-05-07T20:31:46.7594648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7594750Z             op = silu_mul_quant
2025-05-07T20:31:46.7594842Z             if compiled:
2025-05-07T20:31:46.7594946Z                 op = torch.compile(op)
2025-05-07T20:31:46.7595064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7595142Z     
2025-05-07T20:31:46.7595237Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7595242Z 
2025-05-07T20:31:46.7595352Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7595482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7595594Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7595702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7596079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7596187Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7596683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7596788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7597154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7597379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7597728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7597828Z     kernel = self.compile(
2025-05-07T20:31:46.7598213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7598540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7598745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7598751Z 
2025-05-07T20:31:46.7598962Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54690e50>
2025-05-07T20:31:46.7599745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7600247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c550743a0>}
2025-05-07T20:31:46.7601001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7601201Z context = <triton._C.libtriton.ir.context object at 0x7f1c5455d170>
2025-05-07T20:31:46.7601206Z 
2025-05-07T20:31:46.7601384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7601649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7601762Z                            module_map=module_map)
2025-05-07T20:31:46.7601934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7602037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7602118Z E       ^
2025-05-07T20:31:46.7602487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7602492Z 
2025-05-07T20:31:46.7602913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7602921Z 
2025-05-07T20:31:46.7603037Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7603265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7603349Z     T=2048,
2025-05-07T20:31:46.7603438Z     D=7168,
2025-05-07T20:31:46.7603525Z     scale_ub=1200.0,
2025-05-07T20:31:46.7603613Z     contiguous=False,
2025-05-07T20:31:46.7603710Z     compiled=True,
2025-05-07T20:31:46.7603791Z )
2025-05-07T20:31:46.7604017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7604194Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7604198Z 
2025-05-07T20:31:46.7604278Z     @given(
2025-05-07T20:31:46.7604410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7604516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7604633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7604764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7604884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7604969Z     )
2025-05-07T20:31:46.7605228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7605326Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7605417Z         self,
2025-05-07T20:31:46.7605497Z         T: int,
2025-05-07T20:31:46.7605578Z         D: int,
2025-05-07T20:31:46.7605690Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7605786Z         contiguous: bool,
2025-05-07T20:31:46.7605876Z         compiled: bool,
2025-05-07T20:31:46.7605971Z     ) -> None:
2025-05-07T20:31:46.7606071Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7606152Z     
2025-05-07T20:31:46.7606335Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7606414Z     
2025-05-07T20:31:46.7606515Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7606676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7606881Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7606976Z         x0 = x[:, :D]
2025-05-07T20:31:46.7607144Z         x1 = x[:, D:]
2025-05-07T20:31:46.7607225Z     
2025-05-07T20:31:46.7607323Z         if contiguous:
2025-05-07T20:31:46.7607420Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7607513Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7607601Z     
2025-05-07T20:31:46.7607700Z         if scale_ub is not None:
2025-05-07T20:31:46.7607812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7607960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7608040Z             )
2025-05-07T20:31:46.7608122Z         else:
2025-05-07T20:31:46.7608234Z             scale_ub_tensor = None
2025-05-07T20:31:46.7608312Z     
2025-05-07T20:31:46.7608455Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7608550Z             op = silu_mul_quant
2025-05-07T20:31:46.7608648Z             if compiled:
2025-05-07T20:31:46.7608766Z                 op = torch.compile(op)
2025-05-07T20:31:46.7608881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7608960Z     
2025-05-07T20:31:46.7609064Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7609069Z 
2025-05-07T20:31:46.7609173Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7609304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7609417Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7609522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7609898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7609996Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7610492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7610601Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7610969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7611204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7611555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7611666Z     kernel = self.compile(
2025-05-07T20:31:46.7612055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7612242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7612373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7612377Z 
2025-05-07T20:31:46.7612584Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54693ca0>
2025-05-07T20:31:46.7613368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7613874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55075fc0>}
2025-05-07T20:31:46.7614622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7614815Z context = <triton._C.libtriton.ir.context object at 0x7f1c545b6930>
2025-05-07T20:31:46.7614820Z 
2025-05-07T20:31:46.7614987Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7615258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7615367Z                            module_map=module_map)
2025-05-07T20:31:46.7615634Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7615849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7615931Z E       ^
2025-05-07T20:31:46.7616291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7616296Z 
2025-05-07T20:31:46.7616711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7616715Z 
2025-05-07T20:31:46.7616826Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7617054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7617134Z     T=1,
2025-05-07T20:31:46.7617219Z     D=5120,
2025-05-07T20:31:46.7617306Z     scale_ub=None,
2025-05-07T20:31:46.7617398Z     contiguous=False,
2025-05-07T20:31:46.7617495Z     compiled=False,
2025-05-07T20:31:46.7617580Z )
2025-05-07T20:31:46.7617798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7617980Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7617985Z 
2025-05-07T20:31:46.7618065Z     @given(
2025-05-07T20:31:46.7618194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7618295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7618414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7618541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7618657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7618737Z     )
2025-05-07T20:31:46.7618993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7619093Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7619177Z         self,
2025-05-07T20:31:46.7619265Z         T: int,
2025-05-07T20:31:46.7619345Z         D: int,
2025-05-07T20:31:46.7619454Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7619556Z         contiguous: bool,
2025-05-07T20:31:46.7619648Z         compiled: bool,
2025-05-07T20:31:46.7619743Z     ) -> None:
2025-05-07T20:31:46.7619959Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7620041Z     
2025-05-07T20:31:46.7620220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7620298Z     
2025-05-07T20:31:46.7620394Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7620529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7620623Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7620708Z         x0 = x[:, :D]
2025-05-07T20:31:46.7620802Z         x1 = x[:, D:]
2025-05-07T20:31:46.7620879Z     
2025-05-07T20:31:46.7620968Z         if contiguous:
2025-05-07T20:31:46.7621072Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7621164Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7621240Z     
2025-05-07T20:31:46.7621343Z         if scale_ub is not None:
2025-05-07T20:31:46.7621456Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7621608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7621688Z             )
2025-05-07T20:31:46.7621769Z         else:
2025-05-07T20:31:46.7621873Z             scale_ub_tensor = None
2025-05-07T20:31:46.7621950Z     
2025-05-07T20:31:46.7622081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7622182Z             op = silu_mul_quant
2025-05-07T20:31:46.7622271Z             if compiled:
2025-05-07T20:31:46.7622378Z                 op = torch.compile(op)
2025-05-07T20:31:46.7622493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7622571Z     
2025-05-07T20:31:46.7622664Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7622675Z 
2025-05-07T20:31:46.7622778Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7622906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7623109Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7623211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7623783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7623890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7624247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7624475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7624816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7624913Z     kernel = self.compile(
2025-05-07T20:31:46.7625309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7625485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7625618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7625623Z 
2025-05-07T20:31:46.7625842Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54690b20>
2025-05-07T20:31:46.7626616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7627131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55077490>}
2025-05-07T20:31:46.7627872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7628070Z context = <triton._C.libtriton.ir.context object at 0x7f1c544af270>
2025-05-07T20:31:46.7628078Z 
2025-05-07T20:31:46.7628250Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7628511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7628628Z                            module_map=module_map)
2025-05-07T20:31:46.7628794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7628895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7628981Z E       ^
2025-05-07T20:31:46.7629335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7629340Z 
2025-05-07T20:31:46.7629759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7629764Z 
2025-05-07T20:31:46.7629872Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7630098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7630186Z     T=4096,
2025-05-07T20:31:46.7630270Z     D=7168,
2025-05-07T20:31:46.7630358Z     scale_ub=1200.0,
2025-05-07T20:31:46.7630456Z     contiguous=False,
2025-05-07T20:31:46.7630546Z     compiled=False,
2025-05-07T20:31:46.7630628Z )
2025-05-07T20:31:46.7630845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7631024Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7631028Z 
2025-05-07T20:31:46.7631113Z     @given(
2025-05-07T20:31:46.7631233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7631334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7631455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7631572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7631693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7631855Z     )
2025-05-07T20:31:46.7632104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7632281Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7632363Z         self,
2025-05-07T20:31:46.7632442Z         T: int,
2025-05-07T20:31:46.7632526Z         D: int,
2025-05-07T20:31:46.7632628Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7632723Z         contiguous: bool,
2025-05-07T20:31:46.7632819Z         compiled: bool,
2025-05-07T20:31:46.7632904Z     ) -> None:
2025-05-07T20:31:46.7633002Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7633085Z     
2025-05-07T20:31:46.7633257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7633333Z     
2025-05-07T20:31:46.7633433Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7633558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7633658Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7633743Z         x0 = x[:, :D]
2025-05-07T20:31:46.7633832Z         x1 = x[:, D:]
2025-05-07T20:31:46.7633916Z     
2025-05-07T20:31:46.7634004Z         if contiguous:
2025-05-07T20:31:46.7634107Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7634207Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7634287Z     
2025-05-07T20:31:46.7634382Z         if scale_ub is not None:
2025-05-07T20:31:46.7634495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7634631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7634712Z             )
2025-05-07T20:31:46.7634801Z         else:
2025-05-07T20:31:46.7634898Z             scale_ub_tensor = None
2025-05-07T20:31:46.7634984Z     
2025-05-07T20:31:46.7635116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7635210Z             op = silu_mul_quant
2025-05-07T20:31:46.7635305Z             if compiled:
2025-05-07T20:31:46.7635408Z                 op = torch.compile(op)
2025-05-07T20:31:46.7635516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7635606Z     
2025-05-07T20:31:46.7635702Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7635706Z 
2025-05-07T20:31:46.7635816Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7635953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7636058Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7636167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7636671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7636772Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7637136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7637363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7637703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7637817Z     kernel = self.compile(
2025-05-07T20:31:46.7638207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7638390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7638519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7638523Z 
2025-05-07T20:31:46.7638728Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c545793f0>
2025-05-07T20:31:46.7639507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7640009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54500550>}
2025-05-07T20:31:46.7640922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7641115Z context = <triton._C.libtriton.ir.context object at 0x7f1c54460db0>
2025-05-07T20:31:46.7641119Z 
2025-05-07T20:31:46.7641293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7641559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7641668Z                            module_map=module_map)
2025-05-07T20:31:46.7641839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7641942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7642022Z E       ^
2025-05-07T20:31:46.7642382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7642395Z 
2025-05-07T20:31:46.7642817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7642822Z 
2025-05-07T20:31:46.7642935Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7643159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7643240Z     T=16384,
2025-05-07T20:31:46.7643326Z     D=7168,
2025-05-07T20:31:46.7643413Z     scale_ub=None,
2025-05-07T20:31:46.7643502Z     contiguous=True,
2025-05-07T20:31:46.7643595Z     compiled=True,
2025-05-07T20:31:46.7643673Z )
2025-05-07T20:31:46.7643893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7644076Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7644080Z 
2025-05-07T20:31:46.7644162Z     @given(
2025-05-07T20:31:46.7644288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7644397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7644524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7644649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7644764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7644842Z     )
2025-05-07T20:31:46.7645099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7645196Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7645276Z         self,
2025-05-07T20:31:46.7645362Z         T: int,
2025-05-07T20:31:46.7645441Z         D: int,
2025-05-07T20:31:46.7645553Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7645646Z         contiguous: bool,
2025-05-07T20:31:46.7645735Z         compiled: bool,
2025-05-07T20:31:46.7645821Z     ) -> None:
2025-05-07T20:31:46.7645919Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7645994Z     
2025-05-07T20:31:46.7646172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7646253Z     
2025-05-07T20:31:46.7646352Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7646489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7646582Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7646665Z         x0 = x[:, :D]
2025-05-07T20:31:46.7646757Z         x1 = x[:, D:]
2025-05-07T20:31:46.7646832Z     
2025-05-07T20:31:46.7646921Z         if contiguous:
2025-05-07T20:31:46.7647024Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7647115Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7647198Z     
2025-05-07T20:31:46.7647293Z         if scale_ub is not None:
2025-05-07T20:31:46.7647403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7647548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7647626Z             )
2025-05-07T20:31:46.7647706Z         else:
2025-05-07T20:31:46.7647812Z             scale_ub_tensor = None
2025-05-07T20:31:46.7648013Z     
2025-05-07T20:31:46.7648147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7648325Z             op = silu_mul_quant
2025-05-07T20:31:46.7648416Z             if compiled:
2025-05-07T20:31:46.7648522Z                 op = torch.compile(op)
2025-05-07T20:31:46.7648641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7648721Z     
2025-05-07T20:31:46.7648827Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7648832Z 
2025-05-07T20:31:46.7648934Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7649068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7649179Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7649282Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7649653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7649757Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7650255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7650369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7650732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7650959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7651308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7651405Z     kernel = self.compile(
2025-05-07T20:31:46.7651791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7651977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7652105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7652116Z 
2025-05-07T20:31:46.7652333Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c547ce050>
2025-05-07T20:31:46.7653109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7653625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54501360>}
2025-05-07T20:31:46.7654369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7654561Z context = <triton._C.libtriton.ir.context object at 0x7f1c546a0870>
2025-05-07T20:31:46.7654566Z 
2025-05-07T20:31:46.7654744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7655016Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7655132Z                            module_map=module_map)
2025-05-07T20:31:46.7655297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7655398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7655485Z E       ^
2025-05-07T20:31:46.7655838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7655843Z 
2025-05-07T20:31:46.7656256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7656267Z 
2025-05-07T20:31:46.7656375Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7656597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7656682Z     T=4096,
2025-05-07T20:31:46.7656850Z     D=5120,
2025-05-07T20:31:46.7656935Z     scale_ub=None,
2025-05-07T20:31:46.7657029Z     contiguous=False,
2025-05-07T20:31:46.7657188Z     compiled=True,
2025-05-07T20:31:46.7657267Z )
2025-05-07T20:31:46.7657491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7657666Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7657671Z 
2025-05-07T20:31:46.7657752Z     @given(
2025-05-07T20:31:46.7657878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7657980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7658105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7658223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7658342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7658426Z     )
2025-05-07T20:31:46.7658675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7658778Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7658865Z         self,
2025-05-07T20:31:46.7658945Z         T: int,
2025-05-07T20:31:46.7659030Z         D: int,
2025-05-07T20:31:46.7659137Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7659230Z         contiguous: bool,
2025-05-07T20:31:46.7659327Z         compiled: bool,
2025-05-07T20:31:46.7659409Z     ) -> None:
2025-05-07T20:31:46.7659508Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7659590Z     
2025-05-07T20:31:46.7659759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7659938Z     
2025-05-07T20:31:46.7660039Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7660166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7660257Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7660346Z         x0 = x[:, :D]
2025-05-07T20:31:46.7660429Z         x1 = x[:, D:]
2025-05-07T20:31:46.7660510Z     
2025-05-07T20:31:46.7660605Z         if contiguous:
2025-05-07T20:31:46.7660704Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7660796Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7660885Z     
2025-05-07T20:31:46.7660978Z         if scale_ub is not None:
2025-05-07T20:31:46.7661092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7661229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7661308Z             )
2025-05-07T20:31:46.7661392Z         else:
2025-05-07T20:31:46.7661491Z             scale_ub_tensor = None
2025-05-07T20:31:46.7661567Z     
2025-05-07T20:31:46.7661703Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7661798Z             op = silu_mul_quant
2025-05-07T20:31:46.7661886Z             if compiled:
2025-05-07T20:31:46.7661994Z                 op = torch.compile(op)
2025-05-07T20:31:46.7662102Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7662178Z     
2025-05-07T20:31:46.7662280Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7662288Z 
2025-05-07T20:31:46.7662391Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7662533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7662638Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7662740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7663113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7663214Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7663706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7663816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7664181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7664411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7664851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7665027Z     kernel = self.compile(
2025-05-07T20:31:46.7665423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7665603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7665739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7665744Z 
2025-05-07T20:31:46.7665952Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c544e8ee0>
2025-05-07T20:31:46.7666722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7667228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54501ea0>}
2025-05-07T20:31:46.7667978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7668175Z context = <triton._C.libtriton.ir.context object at 0x7f1c54736170>
2025-05-07T20:31:46.7668180Z 
2025-05-07T20:31:46.7668349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7668612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7668728Z                            module_map=module_map)
2025-05-07T20:31:46.7668894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7669002Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7669084Z E       ^
2025-05-07T20:31:46.7669436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7669445Z 
2025-05-07T20:31:46.7669881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7669886Z 
2025-05-07T20:31:46.7669993Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7670226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7670307Z     T=4096,
2025-05-07T20:31:46.7670388Z     D=5120,
2025-05-07T20:31:46.7670481Z     scale_ub=1200.0,
2025-05-07T20:31:46.7670572Z     contiguous=False,
2025-05-07T20:31:46.7670663Z     compiled=False,
2025-05-07T20:31:46.7670746Z )
2025-05-07T20:31:46.7670965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7671143Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7671148Z 
2025-05-07T20:31:46.7671242Z     @given(
2025-05-07T20:31:46.7671364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7671480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7671598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7671717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7671838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7671916Z     )
2025-05-07T20:31:46.7672162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7672270Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7672350Z         self,
2025-05-07T20:31:46.7672429Z         T: int,
2025-05-07T20:31:46.7672515Z         D: int,
2025-05-07T20:31:46.7672616Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7672709Z         contiguous: bool,
2025-05-07T20:31:46.7672803Z         compiled: bool,
2025-05-07T20:31:46.7672884Z     ) -> None:
2025-05-07T20:31:46.7672986Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7673160Z     
2025-05-07T20:31:46.7673342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7673594Z     
2025-05-07T20:31:46.7673692Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7673826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7673918Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7674001Z         x0 = x[:, :D]
2025-05-07T20:31:46.7674091Z         x1 = x[:, D:]
2025-05-07T20:31:46.7674167Z     
2025-05-07T20:31:46.7674256Z         if contiguous:
2025-05-07T20:31:46.7674362Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7674456Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7674534Z     
2025-05-07T20:31:46.7674637Z         if scale_ub is not None:
2025-05-07T20:31:46.7674745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7674883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7674972Z             )
2025-05-07T20:31:46.7675053Z         else:
2025-05-07T20:31:46.7675165Z             scale_ub_tensor = None
2025-05-07T20:31:46.7675243Z     
2025-05-07T20:31:46.7675378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7675476Z             op = silu_mul_quant
2025-05-07T20:31:46.7675565Z             if compiled:
2025-05-07T20:31:46.7675668Z                 op = torch.compile(op)
2025-05-07T20:31:46.7675784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7675860Z     
2025-05-07T20:31:46.7675957Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7675961Z 
2025-05-07T20:31:46.7676069Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7676203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7676315Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7676419Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7676915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7677030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7677392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7677620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7677967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7678063Z     kernel = self.compile(
2025-05-07T20:31:46.7678451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7678628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7678759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7678763Z 
2025-05-07T20:31:46.7678974Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c5447d630>
2025-05-07T20:31:46.7679753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7680260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54502680>}
2025-05-07T20:31:46.7681000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7681193Z context = <triton._C.libtriton.ir.context object at 0x7f1c54645e70>
2025-05-07T20:31:46.7681204Z 
2025-05-07T20:31:46.7681373Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7681635Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7681868Z                            module_map=module_map)
2025-05-07T20:31:46.7682107Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7682211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7682298Z E       ^
2025-05-07T20:31:46.7682649Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7682654Z 
2025-05-07T20:31:46.7683075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7683080Z 
2025-05-07T20:31:46.7683189Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7683411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7683501Z     T=4096,
2025-05-07T20:31:46.7683579Z     D=5120,
2025-05-07T20:31:46.7683666Z     scale_ub=1200.0,
2025-05-07T20:31:46.7683763Z     contiguous=False,
2025-05-07T20:31:46.7683855Z     compiled=True,
2025-05-07T20:31:46.7683931Z )
2025-05-07T20:31:46.7684159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7684337Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7684342Z 
2025-05-07T20:31:46.7684428Z     @given(
2025-05-07T20:31:46.7684550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7684652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7684776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7684895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7685012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7685097Z     )
2025-05-07T20:31:46.7685346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7685448Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7685531Z         self,
2025-05-07T20:31:46.7685616Z         T: int,
2025-05-07T20:31:46.7685702Z         D: int,
2025-05-07T20:31:46.7685806Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7685904Z         contiguous: bool,
2025-05-07T20:31:46.7686000Z         compiled: bool,
2025-05-07T20:31:46.7686081Z     ) -> None:
2025-05-07T20:31:46.7686178Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7686263Z     
2025-05-07T20:31:46.7686431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7686509Z     
2025-05-07T20:31:46.7686611Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7686737Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7686830Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7686919Z         x0 = x[:, :D]
2025-05-07T20:31:46.7687003Z         x1 = x[:, D:]
2025-05-07T20:31:46.7687086Z     
2025-05-07T20:31:46.7687174Z         if contiguous:
2025-05-07T20:31:46.7687271Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7687372Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7687453Z     
2025-05-07T20:31:46.7687547Z         if scale_ub is not None:
2025-05-07T20:31:46.7687668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7687806Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7687885Z             )
2025-05-07T20:31:46.7687969Z         else:
2025-05-07T20:31:46.7688068Z             scale_ub_tensor = None
2025-05-07T20:31:46.7688145Z     
2025-05-07T20:31:46.7688282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7688375Z             op = silu_mul_quant
2025-05-07T20:31:46.7688471Z             if compiled:
2025-05-07T20:31:46.7688573Z                 op = torch.compile(op)
2025-05-07T20:31:46.7688682Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7688767Z     
2025-05-07T20:31:46.7688864Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7688868Z 
2025-05-07T20:31:46.7688970Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7689110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7689319Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7689494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7690905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7691213Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7691762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7691870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7692233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7692466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7692811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7692947Z     kernel = self.compile(
2025-05-07T20:31:46.7693366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7693552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7693690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7693696Z 
2025-05-07T20:31:46.7693907Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54d2b790>
2025-05-07T20:31:46.7694682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7695188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54503ac0>}
2025-05-07T20:31:46.7695942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7696142Z context = <triton._C.libtriton.ir.context object at 0x7f1c54d3d030>
2025-05-07T20:31:46.7696147Z 
2025-05-07T20:31:46.7696315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7696589Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7696699Z                            module_map=module_map)
2025-05-07T20:31:46.7696864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7696976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7697056Z E       ^
2025-05-07T20:31:46.7697416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7697428Z 
2025-05-07T20:31:46.7697855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7697860Z 
2025-05-07T20:31:46.7697964Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7698192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7698269Z     T=2048,
2025-05-07T20:31:46.7698348Z     D=7168,
2025-05-07T20:31:46.7698437Z     scale_ub=1200.0,
2025-05-07T20:31:46.7698521Z     contiguous=False,
2025-05-07T20:31:46.7698607Z     compiled=False,
2025-05-07T20:31:46.7698694Z )
2025-05-07T20:31:46.7698910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7699086Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7699098Z 
2025-05-07T20:31:46.7699175Z     @given(
2025-05-07T20:31:46.7699295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7699736Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7699952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7700215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7700336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7700412Z     )
2025-05-07T20:31:46.7700663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7700765Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7700842Z         self,
2025-05-07T20:31:46.7700923Z         T: int,
2025-05-07T20:31:46.7700998Z         D: int,
2025-05-07T20:31:46.7701098Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7701196Z         contiguous: bool,
2025-05-07T20:31:46.7701283Z         compiled: bool,
2025-05-07T20:31:46.7701362Z     ) -> None:
2025-05-07T20:31:46.7701462Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7701536Z     
2025-05-07T20:31:46.7701709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7701801Z     
2025-05-07T20:31:46.7701892Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7702022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7702119Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7702202Z         x0 = x[:, :D]
2025-05-07T20:31:46.7702282Z         x1 = x[:, D:]
2025-05-07T20:31:46.7702362Z     
2025-05-07T20:31:46.7702448Z         if contiguous:
2025-05-07T20:31:46.7702548Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7702638Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7702711Z     
2025-05-07T20:31:46.7702807Z         if scale_ub is not None:
2025-05-07T20:31:46.7702916Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7703053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7703133Z             )
2025-05-07T20:31:46.7703213Z         else:
2025-05-07T20:31:46.7703307Z             scale_ub_tensor = None
2025-05-07T20:31:46.7703388Z     
2025-05-07T20:31:46.7703525Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7703618Z             op = silu_mul_quant
2025-05-07T20:31:46.7703713Z             if compiled:
2025-05-07T20:31:46.7703813Z                 op = torch.compile(op)
2025-05-07T20:31:46.7703925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7704002Z     
2025-05-07T20:31:46.7704096Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7704100Z 
2025-05-07T20:31:46.7704206Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7704337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7704441Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7704549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7705048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7705149Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7705519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7705757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7706104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7706201Z     kernel = self.compile(
2025-05-07T20:31:46.7706583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7706765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7706896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7706900Z 
2025-05-07T20:31:46.7707110Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54d2c2b0>
2025-05-07T20:31:46.7707879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7708550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c55826200>}
2025-05-07T20:31:46.7709315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7709506Z context = <triton._C.libtriton.ir.context object at 0x7f1c54cd8cf0>
2025-05-07T20:31:46.7709511Z 
2025-05-07T20:31:46.7709685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7709952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7710059Z                            module_map=module_map)
2025-05-07T20:31:46.7710238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7710338Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7710429Z E       ^
2025-05-07T20:31:46.7710781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7710786Z 
2025-05-07T20:31:46.7711204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7711208Z 
2025-05-07T20:31:46.7711318Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7711539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7711622Z     T=1,
2025-05-07T20:31:46.7711700Z     D=7168,
2025-05-07T20:31:46.7711781Z     scale_ub=None,
2025-05-07T20:31:46.7711877Z     contiguous=True,
2025-05-07T20:31:46.7711964Z     compiled=False,
2025-05-07T20:31:46.7712039Z )
2025-05-07T20:31:46.7712263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7712432Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.7712443Z 
2025-05-07T20:31:46.7712523Z     @given(
2025-05-07T20:31:46.7712656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7712756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7712880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7712998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7713111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7713195Z     )
2025-05-07T20:31:46.7713439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7713556Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7713631Z         self,
2025-05-07T20:31:46.7713705Z         T: int,
2025-05-07T20:31:46.7713790Z         D: int,
2025-05-07T20:31:46.7713888Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7719963Z         contiguous: bool,
2025-05-07T20:31:46.7720083Z         compiled: bool,
2025-05-07T20:31:46.7720170Z     ) -> None:
2025-05-07T20:31:46.7720281Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7720368Z     
2025-05-07T20:31:46.7720550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7720632Z     
2025-05-07T20:31:46.7720741Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7720872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7720967Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7721058Z         x0 = x[:, :D]
2025-05-07T20:31:46.7721141Z         x1 = x[:, D:]
2025-05-07T20:31:46.7721226Z     
2025-05-07T20:31:46.7721315Z         if contiguous:
2025-05-07T20:31:46.7721413Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7721515Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7721591Z     
2025-05-07T20:31:46.7721685Z         if scale_ub is not None:
2025-05-07T20:31:46.7721914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7722055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7722214Z             )
2025-05-07T20:31:46.7722306Z         else:
2025-05-07T20:31:46.7722405Z             scale_ub_tensor = None
2025-05-07T20:31:46.7722482Z     
2025-05-07T20:31:46.7722625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7722721Z             op = silu_mul_quant
2025-05-07T20:31:46.7722810Z             if compiled:
2025-05-07T20:31:46.7722924Z                 op = torch.compile(op)
2025-05-07T20:31:46.7723034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7723120Z     
2025-05-07T20:31:46.7723216Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7723221Z 
2025-05-07T20:31:46.7723324Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7723470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7723576Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7723687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7724209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7724314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7724685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7724914Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7725265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7725376Z     kernel = self.compile(
2025-05-07T20:31:46.7725768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7725946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7726086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7726095Z 
2025-05-07T20:31:46.7726310Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54d2c790>
2025-05-07T20:31:46.7727149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7727652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c484c0>}
2025-05-07T20:31:46.7728407Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7728602Z context = <triton._C.libtriton.ir.context object at 0x7f1c542af370>
2025-05-07T20:31:46.7728612Z 
2025-05-07T20:31:46.7728781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7729063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7729175Z                            module_map=module_map)
2025-05-07T20:31:46.7729349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7729450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7729531Z E       ^
2025-05-07T20:31:46.7729895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7729900Z 
2025-05-07T20:31:46.7730313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7730318Z 
2025-05-07T20:31:46.7730425Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7730655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7730822Z     T=16384,
2025-05-07T20:31:46.7730909Z     D=7168,
2025-05-07T20:31:46.7731097Z     scale_ub=1200.0,
2025-05-07T20:31:46.7731188Z     contiguous=False,
2025-05-07T20:31:46.7731284Z     compiled=True,
2025-05-07T20:31:46.7731361Z )
2025-05-07T20:31:46.7731580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7731778Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7731783Z 
2025-05-07T20:31:46.7731863Z     @given(
2025-05-07T20:31:46.7731985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7732101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7732221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7732345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7732459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7732538Z     )
2025-05-07T20:31:46.7732805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7732900Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7732984Z         self,
2025-05-07T20:31:46.7733071Z         T: int,
2025-05-07T20:31:46.7733151Z         D: int,
2025-05-07T20:31:46.7733253Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7733354Z         contiguous: bool,
2025-05-07T20:31:46.7733444Z         compiled: bool,
2025-05-07T20:31:46.7733525Z     ) -> None:
2025-05-07T20:31:46.7733631Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7733706Z     
2025-05-07T20:31:46.7733889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7733969Z     
2025-05-07T20:31:46.7734069Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7734202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7734295Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7734378Z         x0 = x[:, :D]
2025-05-07T20:31:46.7734469Z         x1 = x[:, D:]
2025-05-07T20:31:46.7734552Z     
2025-05-07T20:31:46.7734639Z         if contiguous:
2025-05-07T20:31:46.7734753Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7734846Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7734921Z     
2025-05-07T20:31:46.7735024Z         if scale_ub is not None:
2025-05-07T20:31:46.7735135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7735283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7735363Z             )
2025-05-07T20:31:46.7735446Z         else:
2025-05-07T20:31:46.7735552Z             scale_ub_tensor = None
2025-05-07T20:31:46.7735628Z     
2025-05-07T20:31:46.7735760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7735860Z             op = silu_mul_quant
2025-05-07T20:31:46.7735948Z             if compiled:
2025-05-07T20:31:46.7736053Z                 op = torch.compile(op)
2025-05-07T20:31:46.7736172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7736252Z     
2025-05-07T20:31:46.7736347Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7736352Z 
2025-05-07T20:31:46.7736472Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7736603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7736717Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7736821Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7737197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7737304Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7737795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7737897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7738264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7738488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7739004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7739104Z     kernel = self.compile(
2025-05-07T20:31:46.7739494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7739687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7739912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7739918Z 
2025-05-07T20:31:46.7740132Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c543070d0>
2025-05-07T20:31:46.7740907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7741421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c495a0>}
2025-05-07T20:31:46.7742169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7742359Z context = <triton._C.libtriton.ir.context object at 0x7f1c54296870>
2025-05-07T20:31:46.7742364Z 
2025-05-07T20:31:46.7742545Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7742811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7742917Z                            module_map=module_map)
2025-05-07T20:31:46.7743092Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7743193Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7743281Z E       ^
2025-05-07T20:31:46.7743635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7743640Z 
2025-05-07T20:31:46.7744061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7744066Z 
2025-05-07T20:31:46.7744169Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7744398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7744472Z     T=1,
2025-05-07T20:31:46.7744550Z     D=7168,
2025-05-07T20:31:46.7744641Z     scale_ub=None,
2025-05-07T20:31:46.7744732Z     contiguous=False,
2025-05-07T20:31:46.7744815Z     compiled=False,
2025-05-07T20:31:46.7744895Z )
2025-05-07T20:31:46.7745108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7745282Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7745292Z 
2025-05-07T20:31:46.7745369Z     @given(
2025-05-07T20:31:46.7745493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7745598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7745711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7745825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7745947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7746020Z     )
2025-05-07T20:31:46.7746270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7746370Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7746446Z         self,
2025-05-07T20:31:46.7746528Z         T: int,
2025-05-07T20:31:46.7746606Z         D: int,
2025-05-07T20:31:46.7746705Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7746803Z         contiguous: bool,
2025-05-07T20:31:46.7746889Z         compiled: bool,
2025-05-07T20:31:46.7747056Z     ) -> None:
2025-05-07T20:31:46.7747157Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7747232Z     
2025-05-07T20:31:46.7747472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7747554Z     
2025-05-07T20:31:46.7747648Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7747771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7747869Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7747952Z         x0 = x[:, :D]
2025-05-07T20:31:46.7748038Z         x1 = x[:, D:]
2025-05-07T20:31:46.7748111Z     
2025-05-07T20:31:46.7748195Z         if contiguous:
2025-05-07T20:31:46.7748294Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7748383Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7748459Z     
2025-05-07T20:31:46.7748559Z         if scale_ub is not None:
2025-05-07T20:31:46.7748665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7748800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7748887Z             )
2025-05-07T20:31:46.7748964Z         else:
2025-05-07T20:31:46.7749058Z             scale_ub_tensor = None
2025-05-07T20:31:46.7749145Z     
2025-05-07T20:31:46.7749274Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7749364Z             op = silu_mul_quant
2025-05-07T20:31:46.7749459Z             if compiled:
2025-05-07T20:31:46.7749563Z                 op = torch.compile(op)
2025-05-07T20:31:46.7749679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7749753Z     
2025-05-07T20:31:46.7749844Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7749849Z 
2025-05-07T20:31:46.7749953Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7750082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7750185Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7750291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7750783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7750901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7751264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7751486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7751839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7751935Z     kernel = self.compile(
2025-05-07T20:31:46.7752315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7752499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7752627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7752631Z 
2025-05-07T20:31:46.7752851Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c5429bac0>
2025-05-07T20:31:46.7753628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7754134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c49d80>}
2025-05-07T20:31:46.7754869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7755060Z context = <triton._C.libtriton.ir.context object at 0x7f1c5438adf0>
2025-05-07T20:31:46.7755065Z 
2025-05-07T20:31:46.7755243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7755588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7755770Z                            module_map=module_map)
2025-05-07T20:31:46.7755935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7756035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7756119Z E       ^
2025-05-07T20:31:46.7756473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7756478Z 
2025-05-07T20:31:46.7756944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7756957Z 
2025-05-07T20:31:46.7757059Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7757278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7757360Z     T=2048,
2025-05-07T20:31:46.7757438Z     D=7168,
2025-05-07T20:31:46.7757524Z     scale_ub=None,
2025-05-07T20:31:46.7757620Z     contiguous=False,
2025-05-07T20:31:46.7757700Z     compiled=True,
2025-05-07T20:31:46.7757779Z )
2025-05-07T20:31:46.7757999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7758174Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7758178Z 
2025-05-07T20:31:46.7758255Z     @given(
2025-05-07T20:31:46.7758382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7758479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7758600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7758716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7758830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7758908Z     )
2025-05-07T20:31:46.7759156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7759249Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7759340Z         self,
2025-05-07T20:31:46.7759415Z         T: int,
2025-05-07T20:31:46.7759489Z         D: int,
2025-05-07T20:31:46.7759601Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7759692Z         contiguous: bool,
2025-05-07T20:31:46.7759777Z         compiled: bool,
2025-05-07T20:31:46.7759861Z     ) -> None:
2025-05-07T20:31:46.7759952Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7760030Z     
2025-05-07T20:31:46.7760196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7760270Z     
2025-05-07T20:31:46.7760369Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7760491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7760583Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7760670Z         x0 = x[:, :D]
2025-05-07T20:31:46.7760746Z         x1 = x[:, D:]
2025-05-07T20:31:46.7760818Z     
2025-05-07T20:31:46.7760907Z         if contiguous:
2025-05-07T20:31:46.7760999Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7761091Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7761172Z     
2025-05-07T20:31:46.7761269Z         if scale_ub is not None:
2025-05-07T20:31:46.7761380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7761514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7761589Z             )
2025-05-07T20:31:46.7761669Z         else:
2025-05-07T20:31:46.7761764Z             scale_ub_tensor = None
2025-05-07T20:31:46.7761837Z     
2025-05-07T20:31:46.7761969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7762058Z             op = silu_mul_quant
2025-05-07T20:31:46.7762143Z             if compiled:
2025-05-07T20:31:46.7762257Z                 op = torch.compile(op)
2025-05-07T20:31:46.7762363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7762440Z     
2025-05-07T20:31:46.7762538Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7762542Z 
2025-05-07T20:31:46.7762749Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7762890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7763067Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7763170Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7763548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7763643Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7764134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7764240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7764594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7764826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7765171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7765272Z     kernel = self.compile(
2025-05-07T20:31:46.7765665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7765843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7765967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7765978Z 
2025-05-07T20:31:46.7766183Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c543e36d0>
2025-05-07T20:31:46.7766966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7767480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c4af80>}
2025-05-07T20:31:46.7768239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7768436Z context = <triton._C.libtriton.ir.context object at 0x7f1c543be8b0>
2025-05-07T20:31:46.7768441Z 
2025-05-07T20:31:46.7768608Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7768867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7768984Z                            module_map=module_map)
2025-05-07T20:31:46.7769146Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7769253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7769329Z E       ^
2025-05-07T20:31:46.7769685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7769696Z 
2025-05-07T20:31:46.7770128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7770133Z 
2025-05-07T20:31:46.7770236Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7770465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7770541Z     T=4096,
2025-05-07T20:31:46.7770615Z     D=7168,
2025-05-07T20:31:46.7770701Z     scale_ub=None,
2025-05-07T20:31:46.7770786Z     contiguous=False,
2025-05-07T20:31:46.7770869Z     compiled=True,
2025-05-07T20:31:46.7770951Z )
2025-05-07T20:31:46.7771166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7771337Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7771342Z 
2025-05-07T20:31:46.7771425Z     @given(
2025-05-07T20:31:46.7771542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7771725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7771922Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7772042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7772159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7772232Z     )
2025-05-07T20:31:46.7772480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7772578Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7772657Z         self,
2025-05-07T20:31:46.7772736Z         T: int,
2025-05-07T20:31:46.7772816Z         D: int,
2025-05-07T20:31:46.7772913Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7773003Z         contiguous: bool,
2025-05-07T20:31:46.7773096Z         compiled: bool,
2025-05-07T20:31:46.7773174Z     ) -> None:
2025-05-07T20:31:46.7773267Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7773349Z     
2025-05-07T20:31:46.7773524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7773605Z     
2025-05-07T20:31:46.7773705Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7773828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7773925Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7774003Z         x0 = x[:, :D]
2025-05-07T20:31:46.7774083Z         x1 = x[:, D:]
2025-05-07T20:31:46.7774159Z     
2025-05-07T20:31:46.7774241Z         if contiguous:
2025-05-07T20:31:46.7774332Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7774426Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7774498Z     
2025-05-07T20:31:46.7774590Z         if scale_ub is not None:
2025-05-07T20:31:46.7774698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7774832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7774916Z             )
2025-05-07T20:31:46.7774993Z         else:
2025-05-07T20:31:46.7775087Z             scale_ub_tensor = None
2025-05-07T20:31:46.7775172Z     
2025-05-07T20:31:46.7775301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7775396Z             op = silu_mul_quant
2025-05-07T20:31:46.7775488Z             if compiled:
2025-05-07T20:31:46.7775588Z                 op = torch.compile(op)
2025-05-07T20:31:46.7775694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7775770Z     
2025-05-07T20:31:46.7775861Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7775865Z 
2025-05-07T20:31:46.7775963Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7776098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7776198Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7776304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7776669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7776761Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7777272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7777371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7777726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7777958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7778298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7778400Z     kernel = self.compile(
2025-05-07T20:31:46.7778778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7778952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7779084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7779178Z 
2025-05-07T20:31:46.7779383Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c5442fa00>
2025-05-07T20:31:46.7780288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7780797Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54c4be20>}
2025-05-07T20:31:46.7781542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7781730Z context = <triton._C.libtriton.ir.context object at 0x7f1c544243b0>
2025-05-07T20:31:46.7781735Z 
2025-05-07T20:31:46.7781898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7782181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7782288Z                            module_map=module_map)
2025-05-07T20:31:46.7782451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7782558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7782633Z E       ^
2025-05-07T20:31:46.7782998Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7783003Z 
2025-05-07T20:31:46.7783413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7783418Z 
2025-05-07T20:31:46.7783524Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7783751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7783836Z     T=16384,
2025-05-07T20:31:46.7783914Z     D=5120,
2025-05-07T20:31:46.7784005Z     scale_ub=1200.0,
2025-05-07T20:31:46.7784091Z     contiguous=False,
2025-05-07T20:31:46.7784186Z     compiled=False,
2025-05-07T20:31:46.7784259Z )
2025-05-07T20:31:46.7784471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7784657Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7784662Z 
2025-05-07T20:31:46.7784741Z     @given(
2025-05-07T20:31:46.7784857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7784960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7785075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7785191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7785311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7785385Z     )
2025-05-07T20:31:46.7785638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7785736Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7785812Z         self,
2025-05-07T20:31:46.7785897Z         T: int,
2025-05-07T20:31:46.7785972Z         D: int,
2025-05-07T20:31:46.7786073Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7786169Z         contiguous: bool,
2025-05-07T20:31:46.7786253Z         compiled: bool,
2025-05-07T20:31:46.7786331Z     ) -> None:
2025-05-07T20:31:46.7786427Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7786499Z     
2025-05-07T20:31:46.7786665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7786743Z     
2025-05-07T20:31:46.7786835Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7786967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7787058Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7787139Z         x0 = x[:, :D]
2025-05-07T20:31:46.7787224Z         x1 = x[:, D:]
2025-05-07T20:31:46.7787297Z     
2025-05-07T20:31:46.7787464Z         if contiguous:
2025-05-07T20:31:46.7787563Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7787724Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7787800Z     
2025-05-07T20:31:46.7787897Z         if scale_ub is not None:
2025-05-07T20:31:46.7788000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7788134Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7788215Z             )
2025-05-07T20:31:46.7788293Z         else:
2025-05-07T20:31:46.7788395Z             scale_ub_tensor = None
2025-05-07T20:31:46.7788465Z     
2025-05-07T20:31:46.7788596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7788689Z             op = silu_mul_quant
2025-05-07T20:31:46.7788774Z             if compiled:
2025-05-07T20:31:46.7788874Z                 op = torch.compile(op)
2025-05-07T20:31:46.7788984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7789054Z     
2025-05-07T20:31:46.7789145Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7789153Z 
2025-05-07T20:31:46.7789259Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7789393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7789502Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7789601Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7791374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7791506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7791948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7792207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7792628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7792727Z     kernel = self.compile(
2025-05-07T20:31:46.7793133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7793311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7793438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7793444Z 
2025-05-07T20:31:46.7793655Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54b618d0>
2025-05-07T20:31:46.7794423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7794924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b517e0>}
2025-05-07T20:31:46.7795676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7795869Z context = <triton._C.libtriton.ir.context object at 0x7f1c54b47d70>
2025-05-07T20:31:46.7795881Z 
2025-05-07T20:31:46.7796047Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7796311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7796428Z                            module_map=module_map)
2025-05-07T20:31:46.7796589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7796689Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7796772Z E       ^
2025-05-07T20:31:46.7797121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7797126Z 
2025-05-07T20:31:46.7797544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7797779Z 
2025-05-07T20:31:46.7797991Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7798214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7798303Z     T=16384,
2025-05-07T20:31:46.7798377Z     D=5120,
2025-05-07T20:31:46.7798462Z     scale_ub=1200.0,
2025-05-07T20:31:46.7798553Z     contiguous=True,
2025-05-07T20:31:46.7798636Z     compiled=True,
2025-05-07T20:31:46.7798711Z )
2025-05-07T20:31:46.7798932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7799108Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7799113Z 
2025-05-07T20:31:46.7799200Z     @given(
2025-05-07T20:31:46.7799319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7799420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7799547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7799664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7799783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7799866Z     )
2025-05-07T20:31:46.7800120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7800214Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7800299Z         self,
2025-05-07T20:31:46.7800375Z         T: int,
2025-05-07T20:31:46.7800457Z         D: int,
2025-05-07T20:31:46.7800556Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7800645Z         contiguous: bool,
2025-05-07T20:31:46.7800735Z         compiled: bool,
2025-05-07T20:31:46.7800816Z     ) -> None:
2025-05-07T20:31:46.7800909Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7800988Z     
2025-05-07T20:31:46.7801157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7801232Z     
2025-05-07T20:31:46.7801335Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7801459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7801551Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7801637Z         x0 = x[:, :D]
2025-05-07T20:31:46.7801717Z         x1 = x[:, D:]
2025-05-07T20:31:46.7801798Z     
2025-05-07T20:31:46.7801882Z         if contiguous:
2025-05-07T20:31:46.7801974Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7802070Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7802145Z     
2025-05-07T20:31:46.7802234Z         if scale_ub is not None:
2025-05-07T20:31:46.7802344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7802477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7802553Z             )
2025-05-07T20:31:46.7802638Z         else:
2025-05-07T20:31:46.7802732Z             scale_ub_tensor = None
2025-05-07T20:31:46.7802808Z     
2025-05-07T20:31:46.7802937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7803033Z             op = silu_mul_quant
2025-05-07T20:31:46.7803124Z             if compiled:
2025-05-07T20:31:46.7803227Z                 op = torch.compile(op)
2025-05-07T20:31:46.7803332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7803410Z     
2025-05-07T20:31:46.7803503Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7803507Z 
2025-05-07T20:31:46.7803610Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7803738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7803839Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7803943Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7804306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7804401Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7804907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7805096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7805537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7805763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7806106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7806208Z     kernel = self.compile(
2025-05-07T20:31:46.7806586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7806765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7806893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7806898Z 
2025-05-07T20:31:46.7807101Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54c1af50>
2025-05-07T20:31:46.7807903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7808404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b51090>}
2025-05-07T20:31:46.7809147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7809336Z context = <triton._C.libtriton.ir.context object at 0x7f1c54baf370>
2025-05-07T20:31:46.7809341Z 
2025-05-07T20:31:46.7809506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7809773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7809886Z                            module_map=module_map)
2025-05-07T20:31:46.7810059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7810156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7810231Z E       ^
2025-05-07T20:31:46.7810589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7810594Z 
2025-05-07T20:31:46.7811005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7811010Z 
2025-05-07T20:31:46.7811124Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7811344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7811427Z     T=16384,
2025-05-07T20:31:46.7811513Z     D=5120,
2025-05-07T20:31:46.7811594Z     scale_ub=None,
2025-05-07T20:31:46.7811686Z     contiguous=False,
2025-05-07T20:31:46.7811776Z     compiled=True,
2025-05-07T20:31:46.7811849Z )
2025-05-07T20:31:46.7812069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7812251Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7812255Z 
2025-05-07T20:31:46.7812329Z     @given(
2025-05-07T20:31:46.7812451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7812551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7812662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7812782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7812896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7812973Z     )
2025-05-07T20:31:46.7813226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7813322Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7813398Z         self,
2025-05-07T20:31:46.7813567Z         T: int,
2025-05-07T20:31:46.7813642Z         D: int,
2025-05-07T20:31:46.7813740Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7813907Z         contiguous: bool,
2025-05-07T20:31:46.7813997Z         compiled: bool,
2025-05-07T20:31:46.7814079Z     ) -> None:
2025-05-07T20:31:46.7814172Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7814247Z     
2025-05-07T20:31:46.7814424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7814500Z     
2025-05-07T20:31:46.7814592Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7814721Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7814812Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7814892Z         x0 = x[:, :D]
2025-05-07T20:31:46.7814977Z         x1 = x[:, D:]
2025-05-07T20:31:46.7815052Z     
2025-05-07T20:31:46.7815134Z         if contiguous:
2025-05-07T20:31:46.7815232Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7815328Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7815401Z     
2025-05-07T20:31:46.7815498Z         if scale_ub is not None:
2025-05-07T20:31:46.7815610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7815748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7815821Z             )
2025-05-07T20:31:46.7815898Z         else:
2025-05-07T20:31:46.7815998Z             scale_ub_tensor = None
2025-05-07T20:31:46.7816072Z     
2025-05-07T20:31:46.7816203Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7816301Z             op = silu_mul_quant
2025-05-07T20:31:46.7816390Z             if compiled:
2025-05-07T20:31:46.7816501Z                 op = torch.compile(op)
2025-05-07T20:31:46.7816630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7816718Z     
2025-05-07T20:31:46.7816821Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7816830Z 
2025-05-07T20:31:46.7816928Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7817061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7817171Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7817276Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7817643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7817747Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7818234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7818337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7818694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7818923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7819270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7819373Z     kernel = self.compile(
2025-05-07T20:31:46.7819765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7820049Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7820178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7820183Z 
2025-05-07T20:31:46.7820400Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54b634c0>
2025-05-07T20:31:46.7821167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7821672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b52290>}
2025-05-07T20:31:46.7822620Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7822812Z context = <triton._C.libtriton.ir.context object at 0x7f1c4ff908b0>
2025-05-07T20:31:46.7822817Z 
2025-05-07T20:31:46.7822990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7823253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7823359Z                            module_map=module_map)
2025-05-07T20:31:46.7823527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7823627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7823711Z E       ^
2025-05-07T20:31:46.7824058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7824069Z 
2025-05-07T20:31:46.7824484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7824488Z 
2025-05-07T20:31:46.7824600Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7824819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7824900Z     T=2048,
2025-05-07T20:31:46.7824974Z     D=5120,
2025-05-07T20:31:46.7825053Z     scale_ub=None,
2025-05-07T20:31:46.7825142Z     contiguous=False,
2025-05-07T20:31:46.7825223Z     compiled=True,
2025-05-07T20:31:46.7825298Z )
2025-05-07T20:31:46.7825518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7825688Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.7825693Z 
2025-05-07T20:31:46.7825767Z     @given(
2025-05-07T20:31:46.7825891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7825993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7826114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7826234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7826346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7826425Z     )
2025-05-07T20:31:46.7826672Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7826766Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7826849Z         self,
2025-05-07T20:31:46.7826922Z         T: int,
2025-05-07T20:31:46.7826995Z         D: int,
2025-05-07T20:31:46.7827099Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7827185Z         contiguous: bool,
2025-05-07T20:31:46.7827269Z         compiled: bool,
2025-05-07T20:31:46.7827349Z     ) -> None:
2025-05-07T20:31:46.7827442Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7827521Z     
2025-05-07T20:31:46.7827686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7827763Z     
2025-05-07T20:31:46.7827861Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7827990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7828079Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7828161Z         x0 = x[:, :D]
2025-05-07T20:31:46.7828241Z         x1 = x[:, D:]
2025-05-07T20:31:46.7828312Z     
2025-05-07T20:31:46.7828399Z         if contiguous:
2025-05-07T20:31:46.7828490Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7828577Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7828653Z     
2025-05-07T20:31:46.7828742Z         if scale_ub is not None:
2025-05-07T20:31:46.7828847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7828989Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7829067Z             )
2025-05-07T20:31:46.7829148Z         else:
2025-05-07T20:31:46.7829241Z             scale_ub_tensor = None
2025-05-07T20:31:46.7829314Z     
2025-05-07T20:31:46.7829537Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7829625Z             op = silu_mul_quant
2025-05-07T20:31:46.7829783Z             if compiled:
2025-05-07T20:31:46.7829891Z                 op = torch.compile(op)
2025-05-07T20:31:46.7829996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7830064Z     
2025-05-07T20:31:46.7830160Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7830164Z 
2025-05-07T20:31:46.7830259Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7830390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7830490Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7830589Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7830961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7831053Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7831539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7831653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7832010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7832237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7832572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7832666Z     kernel = self.compile(
2025-05-07T20:31:46.7833047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7833224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7833352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7833363Z 
2025-05-07T20:31:46.7833568Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c54b9d420>
2025-05-07T20:31:46.7834338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7834837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b52170>}
2025-05-07T20:31:46.7835573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7835765Z context = <triton._C.libtriton.ir.context object at 0x7f1c4ff89f70>
2025-05-07T20:31:46.7835770Z 
2025-05-07T20:31:46.7835932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7836195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7836311Z                            module_map=module_map)
2025-05-07T20:31:46.7836471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7836574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7836649Z E       ^
2025-05-07T20:31:46.7837004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7837009Z 
2025-05-07T20:31:46.7837422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7837426Z 
2025-05-07T20:31:46.7837529Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7837747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7837828Z     T=2048,
2025-05-07T20:31:46.7837903Z     D=5120,
2025-05-07T20:31:46.7838074Z     scale_ub=1200.0,
2025-05-07T20:31:46.7838160Z     contiguous=False,
2025-05-07T20:31:46.7838243Z     compiled=True,
2025-05-07T20:31:46.7838395Z )
2025-05-07T20:31:46.7838614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7838791Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7838796Z 
2025-05-07T20:31:46.7838874Z     @given(
2025-05-07T20:31:46.7838991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7839089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7839206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7839326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7839445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7839520Z     )
2025-05-07T20:31:46.7839764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7839867Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7839944Z         self,
2025-05-07T20:31:46.7840016Z         T: int,
2025-05-07T20:31:46.7840097Z         D: int,
2025-05-07T20:31:46.7840198Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7840287Z         contiguous: bool,
2025-05-07T20:31:46.7840377Z         compiled: bool,
2025-05-07T20:31:46.7840455Z     ) -> None:
2025-05-07T20:31:46.7840547Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7840624Z     
2025-05-07T20:31:46.7840790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7840868Z     
2025-05-07T20:31:46.7840959Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7841078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7841178Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7841259Z         x0 = x[:, :D]
2025-05-07T20:31:46.7841343Z         x1 = x[:, D:]
2025-05-07T20:31:46.7841416Z     
2025-05-07T20:31:46.7845879Z         if contiguous:
2025-05-07T20:31:46.7846003Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7846094Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7846169Z     
2025-05-07T20:31:46.7846276Z         if scale_ub is not None:
2025-05-07T20:31:46.7846388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7846529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7846606Z             )
2025-05-07T20:31:46.7846683Z         else:
2025-05-07T20:31:46.7846779Z             scale_ub_tensor = None
2025-05-07T20:31:46.7846851Z     
2025-05-07T20:31:46.7846983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7847077Z             op = silu_mul_quant
2025-05-07T20:31:46.7847167Z             if compiled:
2025-05-07T20:31:46.7847266Z                 op = torch.compile(op)
2025-05-07T20:31:46.7847383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7847452Z     
2025-05-07T20:31:46.7847542Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7847548Z 
2025-05-07T20:31:46.7847657Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7847789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7847895Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7848000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7848374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7848469Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7848964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7849063Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7849421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7849638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7849974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7850185Z     kernel = self.compile(
2025-05-07T20:31:46.7850636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7850818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7850943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7850948Z 
2025-05-07T20:31:46.7851152Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4ff7b580>
2025-05-07T20:31:46.7851927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7852424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54b53880>}
2025-05-07T20:31:46.7853181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7853371Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fff25f0>
2025-05-07T20:31:46.7853376Z 
2025-05-07T20:31:46.7853539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7853804Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7853913Z                            module_map=module_map)
2025-05-07T20:31:46.7854080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7854176Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7854253Z E       ^
2025-05-07T20:31:46.7854609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7854618Z 
2025-05-07T20:31:46.7855031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7855036Z 
2025-05-07T20:31:46.7855145Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7855365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7855442Z     T=4096,
2025-05-07T20:31:46.7855524Z     D=5120,
2025-05-07T20:31:46.7855607Z     scale_ub=1200.0,
2025-05-07T20:31:46.7855689Z     contiguous=True,
2025-05-07T20:31:46.7855775Z     compiled=True,
2025-05-07T20:31:46.7855849Z )
2025-05-07T20:31:46.7856064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7856241Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7856246Z 
2025-05-07T20:31:46.7856320Z     @given(
2025-05-07T20:31:46.7856445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7856551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7856685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7856832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7856941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7857014Z     )
2025-05-07T20:31:46.7857263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7857355Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7857429Z         self,
2025-05-07T20:31:46.7857512Z         T: int,
2025-05-07T20:31:46.7857586Z         D: int,
2025-05-07T20:31:46.7857686Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7857786Z         contiguous: bool,
2025-05-07T20:31:46.7857869Z         compiled: bool,
2025-05-07T20:31:46.7857952Z     ) -> None:
2025-05-07T20:31:46.7858051Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7858120Z     
2025-05-07T20:31:46.7858400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7858474Z     
2025-05-07T20:31:46.7858642Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7858772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7858860Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7858939Z         x0 = x[:, :D]
2025-05-07T20:31:46.7859027Z         x1 = x[:, D:]
2025-05-07T20:31:46.7859099Z     
2025-05-07T20:31:46.7859181Z         if contiguous:
2025-05-07T20:31:46.7859279Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7859368Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7859445Z     
2025-05-07T20:31:46.7859531Z         if scale_ub is not None:
2025-05-07T20:31:46.7859637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7859777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7859953Z             )
2025-05-07T20:31:46.7860030Z         else:
2025-05-07T20:31:46.7860138Z             scale_ub_tensor = None
2025-05-07T20:31:46.7860211Z     
2025-05-07T20:31:46.7860340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7860440Z             op = silu_mul_quant
2025-05-07T20:31:46.7860522Z             if compiled:
2025-05-07T20:31:46.7860621Z                 op = torch.compile(op)
2025-05-07T20:31:46.7860728Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7860798Z     
2025-05-07T20:31:46.7860888Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7860899Z 
2025-05-07T20:31:46.7860994Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7861122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7861227Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7861323Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7861693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7861790Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7862289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7862398Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7862757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7862977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7863318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7863414Z     kernel = self.compile(
2025-05-07T20:31:46.7863797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7863976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7864100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7864109Z 
2025-05-07T20:31:46.7864321Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fe94280>
2025-05-07T20:31:46.7865092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7865599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe44940>}
2025-05-07T20:31:46.7866341Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7866529Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fed4fb0>
2025-05-07T20:31:46.7866534Z 
2025-05-07T20:31:46.7866785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7867116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7867223Z                            module_map=module_map)
2025-05-07T20:31:46.7867392Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7867491Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7867571Z E       ^
2025-05-07T20:31:46.7867920Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7867925Z 
2025-05-07T20:31:46.7868333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7868337Z 
2025-05-07T20:31:46.7868442Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7868660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7868746Z     T=128,
2025-05-07T20:31:46.7868824Z     D=5120,
2025-05-07T20:31:46.7868906Z     scale_ub=1200.0,
2025-05-07T20:31:46.7869007Z     contiguous=False,
2025-05-07T20:31:46.7869088Z     compiled=True,
2025-05-07T20:31:46.7869162Z )
2025-05-07T20:31:46.7869379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7869546Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7869551Z 
2025-05-07T20:31:46.7869628Z     @given(
2025-05-07T20:31:46.7869745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7869846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7869958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7870072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7870193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7870269Z     )
2025-05-07T20:31:46.7870518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7870614Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7870687Z         self,
2025-05-07T20:31:46.7870770Z         T: int,
2025-05-07T20:31:46.7870846Z         D: int,
2025-05-07T20:31:46.7870941Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7871032Z         contiguous: bool,
2025-05-07T20:31:46.7871115Z         compiled: bool,
2025-05-07T20:31:46.7871190Z     ) -> None:
2025-05-07T20:31:46.7871282Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7871354Z     
2025-05-07T20:31:46.7871519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7871595Z     
2025-05-07T20:31:46.7871682Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7871806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7871892Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7871968Z         x0 = x[:, :D]
2025-05-07T20:31:46.7872051Z         x1 = x[:, D:]
2025-05-07T20:31:46.7872118Z     
2025-05-07T20:31:46.7872203Z         if contiguous:
2025-05-07T20:31:46.7872295Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7872387Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7872457Z     
2025-05-07T20:31:46.7872552Z         if scale_ub is not None:
2025-05-07T20:31:46.7872654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7872789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7872869Z             )
2025-05-07T20:31:46.7872942Z         else:
2025-05-07T20:31:46.7873035Z             scale_ub_tensor = None
2025-05-07T20:31:46.7873112Z     
2025-05-07T20:31:46.7873241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7873332Z             op = silu_mul_quant
2025-05-07T20:31:46.7873417Z             if compiled:
2025-05-07T20:31:46.7873514Z                 op = torch.compile(op)
2025-05-07T20:31:46.7873620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7873691Z     
2025-05-07T20:31:46.7873867Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7873871Z 
2025-05-07T20:31:46.7873973Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7874171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7874275Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7874379Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7874743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7874839Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7875337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7875434Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7875793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7876011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7876357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7876458Z     kernel = self.compile(
2025-05-07T20:31:46.7876842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7877019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7877141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7877146Z 
2025-05-07T20:31:46.7877350Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fef81f0>
2025-05-07T20:31:46.7878122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7878626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe451b0>}
2025-05-07T20:31:46.7879380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7879568Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fd06330>
2025-05-07T20:31:46.7879573Z 
2025-05-07T20:31:46.7879741Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7879999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7880105Z                            module_map=module_map)
2025-05-07T20:31:46.7880270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7880365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7880436Z E       ^
2025-05-07T20:31:46.7880798Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7880809Z 
2025-05-07T20:31:46.7881222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7881227Z 
2025-05-07T20:31:46.7881332Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7881550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7881623Z     T=16384,
2025-05-07T20:31:46.7881699Z     D=7168,
2025-05-07T20:31:46.7881777Z     scale_ub=1200.0,
2025-05-07T20:31:46.7881859Z     contiguous=True,
2025-05-07T20:31:46.7881943Z     compiled=True,
2025-05-07T20:31:46.7882015Z )
2025-05-07T20:31:46.7882228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7882404Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7882493Z 
2025-05-07T20:31:46.7882569Z     @given(
2025-05-07T20:31:46.7882689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7882861Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7882976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7883097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7883207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7883277Z     )
2025-05-07T20:31:46.7883522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7883616Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7883696Z         self,
2025-05-07T20:31:46.7883772Z         T: int,
2025-05-07T20:31:46.7883845Z         D: int,
2025-05-07T20:31:46.7883947Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7884033Z         contiguous: bool,
2025-05-07T20:31:46.7884116Z         compiled: bool,
2025-05-07T20:31:46.7884195Z     ) -> None:
2025-05-07T20:31:46.7884291Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7884359Z     
2025-05-07T20:31:46.7884532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7884604Z     
2025-05-07T20:31:46.7884693Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7884820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7884908Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7884992Z         x0 = x[:, :D]
2025-05-07T20:31:46.7885071Z         x1 = x[:, D:]
2025-05-07T20:31:46.7885141Z     
2025-05-07T20:31:46.7885229Z         if contiguous:
2025-05-07T20:31:46.7885319Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7885405Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7885476Z     
2025-05-07T20:31:46.7885564Z         if scale_ub is not None:
2025-05-07T20:31:46.7885667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7885804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7885876Z             )
2025-05-07T20:31:46.7885954Z         else:
2025-05-07T20:31:46.7886051Z             scale_ub_tensor = None
2025-05-07T20:31:46.7886119Z     
2025-05-07T20:31:46.7886251Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7886340Z             op = silu_mul_quant
2025-05-07T20:31:46.7886425Z             if compiled:
2025-05-07T20:31:46.7886527Z                 op = torch.compile(op)
2025-05-07T20:31:46.7886629Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7886699Z     
2025-05-07T20:31:46.7886795Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7886800Z 
2025-05-07T20:31:46.7886895Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7887022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7887126Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7887224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7887585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7887684Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7888184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7888287Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7888637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7888858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7889198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7889290Z     kernel = self.compile(
2025-05-07T20:31:46.7889674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7890164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7890489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7890493Z 
2025-05-07T20:31:46.7890808Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fd03e20>
2025-05-07T20:31:46.7891578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7892077Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe457e0>}
2025-05-07T20:31:46.7892814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7893001Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fdc77b0>
2025-05-07T20:31:46.7893013Z 
2025-05-07T20:31:46.7893181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7893445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7893555Z                            module_map=module_map)
2025-05-07T20:31:46.7893716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7893813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7893893Z E       ^
2025-05-07T20:31:46.7894242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7894247Z 
2025-05-07T20:31:46.7894669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7894674Z 
2025-05-07T20:31:46.7894776Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7894992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7895074Z     T=16384,
2025-05-07T20:31:46.7895149Z     D=5120,
2025-05-07T20:31:46.7895234Z     scale_ub=1200.0,
2025-05-07T20:31:46.7895320Z     contiguous=True,
2025-05-07T20:31:46.7895400Z     compiled=False,
2025-05-07T20:31:46.7895472Z )
2025-05-07T20:31:46.7895686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7895859Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.7895864Z 
2025-05-07T20:31:46.7895944Z     @given(
2025-05-07T20:31:46.7896059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7896157Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7896270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7896380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7896492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7896570Z     )
2025-05-07T20:31:46.7896812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7896910Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7896986Z         self,
2025-05-07T20:31:46.7897059Z         T: int,
2025-05-07T20:31:46.7897136Z         D: int,
2025-05-07T20:31:46.7897235Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7897320Z         contiguous: bool,
2025-05-07T20:31:46.7897408Z         compiled: bool,
2025-05-07T20:31:46.7897481Z     ) -> None:
2025-05-07T20:31:46.7897572Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7897653Z     
2025-05-07T20:31:46.7897817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7897886Z     
2025-05-07T20:31:46.7897978Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7898099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7898186Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7898266Z         x0 = x[:, :D]
2025-05-07T20:31:46.7898429Z         x1 = x[:, D:]
2025-05-07T20:31:46.7898501Z     
2025-05-07T20:31:46.7898584Z         if contiguous:
2025-05-07T20:31:46.7898747Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7898837Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7898911Z     
2025-05-07T20:31:46.7898999Z         if scale_ub is not None:
2025-05-07T20:31:46.7899105Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7899237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7899311Z             )
2025-05-07T20:31:46.7899387Z         else:
2025-05-07T20:31:46.7899477Z             scale_ub_tensor = None
2025-05-07T20:31:46.7899548Z     
2025-05-07T20:31:46.7899676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7899761Z             op = silu_mul_quant
2025-05-07T20:31:46.7899904Z             if compiled:
2025-05-07T20:31:46.7900008Z                 op = torch.compile(op)
2025-05-07T20:31:46.7900110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7900189Z     
2025-05-07T20:31:46.7900278Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7900283Z 
2025-05-07T20:31:46.7900383Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7900510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7900610Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7900706Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7901202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7901297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7901656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7901878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7902213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7902316Z     kernel = self.compile(
2025-05-07T20:31:46.7902701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7902872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7903001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7903005Z 
2025-05-07T20:31:46.7903209Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fd00820>
2025-05-07T20:31:46.7903980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7904476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe46950>}
2025-05-07T20:31:46.7905232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7905421Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fd277b0>
2025-05-07T20:31:46.7905426Z 
2025-05-07T20:31:46.7905588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7905852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7905956Z                            module_map=module_map)
2025-05-07T20:31:46.7906123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7906218Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7906293Z E       ^
2025-05-07T20:31:46.7906645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7906733Z 
2025-05-07T20:31:46.7907273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7907278Z 
2025-05-07T20:31:46.7907383Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7907605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7907683Z     T=1,
2025-05-07T20:31:46.7907756Z     D=7168,
2025-05-07T20:31:46.7907837Z     scale_ub=1200.0,
2025-05-07T20:31:46.7907922Z     contiguous=False,
2025-05-07T20:31:46.7908010Z     compiled=False,
2025-05-07T20:31:46.7908085Z )
2025-05-07T20:31:46.7908299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7908468Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.7908472Z 
2025-05-07T20:31:46.7908546Z     @given(
2025-05-07T20:31:46.7908661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7908768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7908884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7909012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7909122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7909193Z     )
2025-05-07T20:31:46.7909441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7909531Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7909606Z         self,
2025-05-07T20:31:46.7909687Z         T: int,
2025-05-07T20:31:46.7909759Z         D: int,
2025-05-07T20:31:46.7909854Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7909950Z         contiguous: bool,
2025-05-07T20:31:46.7910032Z         compiled: bool,
2025-05-07T20:31:46.7910107Z     ) -> None:
2025-05-07T20:31:46.7910203Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7910274Z     
2025-05-07T20:31:46.7910443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7910518Z     
2025-05-07T20:31:46.7910608Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7910743Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7910835Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7910909Z         x0 = x[:, :D]
2025-05-07T20:31:46.7910989Z         x1 = x[:, D:]
2025-05-07T20:31:46.7911060Z     
2025-05-07T20:31:46.7911140Z         if contiguous:
2025-05-07T20:31:46.7911236Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7911322Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7911392Z     
2025-05-07T20:31:46.7911483Z         if scale_ub is not None:
2025-05-07T20:31:46.7911584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7911719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7911793Z             )
2025-05-07T20:31:46.7911870Z         else:
2025-05-07T20:31:46.7911964Z             scale_ub_tensor = None
2025-05-07T20:31:46.7912040Z     
2025-05-07T20:31:46.7912165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7912251Z             op = silu_mul_quant
2025-05-07T20:31:46.7912338Z             if compiled:
2025-05-07T20:31:46.7912437Z                 op = torch.compile(op)
2025-05-07T20:31:46.7912548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7912619Z     
2025-05-07T20:31:46.7912709Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7912713Z 
2025-05-07T20:31:46.7912812Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7912937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7913043Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7913138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7913638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7913738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7914183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7914498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7914842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7914934Z     kernel = self.compile(
2025-05-07T20:31:46.7915320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7915489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7915614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7915620Z 
2025-05-07T20:31:46.7915823Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fd1d2d0>
2025-05-07T20:31:46.7916640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7917150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fe47ac0>}
2025-05-07T20:31:46.7917886Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7918071Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fc325f0>
2025-05-07T20:31:46.7918078Z 
2025-05-07T20:31:46.7918240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7918505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7918614Z                            module_map=module_map)
2025-05-07T20:31:46.7918780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7918884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7918964Z E       ^
2025-05-07T20:31:46.7919312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7919316Z 
2025-05-07T20:31:46.7919724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7919729Z 
2025-05-07T20:31:46.7919831Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7920049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7920126Z     T=4096,
2025-05-07T20:31:46.7920202Z     D=7168,
2025-05-07T20:31:46.7920285Z     scale_ub=1200.0,
2025-05-07T20:31:46.7920376Z     contiguous=False,
2025-05-07T20:31:46.7920455Z     compiled=True,
2025-05-07T20:31:46.7920528Z )
2025-05-07T20:31:46.7920748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7920923Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7920928Z 
2025-05-07T20:31:46.7921006Z     @given(
2025-05-07T20:31:46.7921125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7921224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7921345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7921461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7921570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7921644Z     )
2025-05-07T20:31:46.7921890Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7921988Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7922064Z         self,
2025-05-07T20:31:46.7922137Z         T: int,
2025-05-07T20:31:46.7922213Z         D: int,
2025-05-07T20:31:46.7922396Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7922482Z         contiguous: bool,
2025-05-07T20:31:46.7922566Z         compiled: bool,
2025-05-07T20:31:46.7922781Z     ) -> None:
2025-05-07T20:31:46.7922874Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7922950Z     
2025-05-07T20:31:46.7923118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7923193Z     
2025-05-07T20:31:46.7923287Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7923407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7923492Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7923572Z         x0 = x[:, :D]
2025-05-07T20:31:46.7923651Z         x1 = x[:, D:]
2025-05-07T20:31:46.7923726Z     
2025-05-07T20:31:46.7923805Z         if contiguous:
2025-05-07T20:31:46.7923895Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7923984Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7924057Z     
2025-05-07T20:31:46.7924145Z         if scale_ub is not None:
2025-05-07T20:31:46.7924255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7924390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7924465Z             )
2025-05-07T20:31:46.7924543Z         else:
2025-05-07T20:31:46.7924633Z             scale_ub_tensor = None
2025-05-07T20:31:46.7924704Z     
2025-05-07T20:31:46.7924832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7924921Z             op = silu_mul_quant
2025-05-07T20:31:46.7925005Z             if compiled:
2025-05-07T20:31:46.7925102Z                 op = torch.compile(op)
2025-05-07T20:31:46.7925204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7925278Z     
2025-05-07T20:31:46.7925366Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7925370Z 
2025-05-07T20:31:46.7925464Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7925594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7925701Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7925803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7926169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7926259Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7926748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7926847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7927207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7927425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7927766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7927862Z     kernel = self.compile(
2025-05-07T20:31:46.7928239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7928426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7928548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7928552Z 
2025-05-07T20:31:46.7928758Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fc866b0>
2025-05-07T20:31:46.7929524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7930018Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc10550>}
2025-05-07T20:31:46.7930756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7931102Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fcacf30>
2025-05-07T20:31:46.7931107Z 
2025-05-07T20:31:46.7931273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7931534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7931638Z                            module_map=module_map)
2025-05-07T20:31:46.7931801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7931898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7931973Z E       ^
2025-05-07T20:31:46.7932327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7932331Z 
2025-05-07T20:31:46.7932746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7932755Z 
2025-05-07T20:31:46.7932868Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7933084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7933160Z     T=128,
2025-05-07T20:31:46.7933234Z     D=7168,
2025-05-07T20:31:46.7933312Z     scale_ub=1200.0,
2025-05-07T20:31:46.7933397Z     contiguous=False,
2025-05-07T20:31:46.7933481Z     compiled=True,
2025-05-07T20:31:46.7933551Z )
2025-05-07T20:31:46.7933760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7933929Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.7933933Z 
2025-05-07T20:31:46.7934010Z     @given(
2025-05-07T20:31:46.7934132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7934227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7934337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7934464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7934582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7934651Z     )
2025-05-07T20:31:46.7934900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7934990Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7935063Z         self,
2025-05-07T20:31:46.7935139Z         T: int,
2025-05-07T20:31:46.7935211Z         D: int,
2025-05-07T20:31:46.7935308Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7935395Z         contiguous: bool,
2025-05-07T20:31:46.7935476Z         compiled: bool,
2025-05-07T20:31:46.7935557Z     ) -> None:
2025-05-07T20:31:46.7935648Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7935718Z     
2025-05-07T20:31:46.7935891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7935964Z     
2025-05-07T20:31:46.7936051Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7936185Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7936270Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7936352Z         x0 = x[:, :D]
2025-05-07T20:31:46.7936432Z         x1 = x[:, D:]
2025-05-07T20:31:46.7936501Z     
2025-05-07T20:31:46.7936583Z         if contiguous:
2025-05-07T20:31:46.7936679Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7936766Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7936844Z     
2025-05-07T20:31:46.7936930Z         if scale_ub is not None:
2025-05-07T20:31:46.7937030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7937162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7937236Z             )
2025-05-07T20:31:46.7937310Z         else:
2025-05-07T20:31:46.7937406Z             scale_ub_tensor = None
2025-05-07T20:31:46.7937473Z     
2025-05-07T20:31:46.7937598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7937784Z             op = silu_mul_quant
2025-05-07T20:31:46.7937867Z             if compiled:
2025-05-07T20:31:46.7937965Z                 op = torch.compile(op)
2025-05-07T20:31:46.7938141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7938211Z     
2025-05-07T20:31:46.7938306Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7938311Z 
2025-05-07T20:31:46.7938408Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7938535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7938635Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7938729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7939093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7939189Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7939675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7939781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7940240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7940460Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7940806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7940896Z     kernel = self.compile(
2025-05-07T20:31:46.7941272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7941446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7941572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7941577Z 
2025-05-07T20:31:46.7941785Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fcb7ac0>
2025-05-07T20:31:46.7942552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7943057Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc10f70>}
2025-05-07T20:31:46.7943796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7943982Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fba23f0>
2025-05-07T20:31:46.7943987Z 
2025-05-07T20:31:46.7944153Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7944409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7944520Z                            module_map=module_map)
2025-05-07T20:31:46.7944688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7944784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7944860Z E       ^
2025-05-07T20:31:46.7945209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7945214Z 
2025-05-07T20:31:46.7945618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7945626Z 
2025-05-07T20:31:46.7945726Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7945943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7946017Z     T=2048,
2025-05-07T20:31:46.7946090Z     D=7168,
2025-05-07T20:31:46.7946169Z     scale_ub=None,
2025-05-07T20:31:46.7946256Z     contiguous=True,
2025-05-07T20:31:46.7946418Z     compiled=True,
2025-05-07T20:31:46.7946491Z )
2025-05-07T20:31:46.7946808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7946975Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.7946979Z 
2025-05-07T20:31:46.7947050Z     @given(
2025-05-07T20:31:46.7947168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7947262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7947379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7947492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7947603Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7947674Z     )
2025-05-07T20:31:46.7947919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7948009Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7948087Z         self,
2025-05-07T20:31:46.7948160Z         T: int,
2025-05-07T20:31:46.7948237Z         D: int,
2025-05-07T20:31:46.7948335Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7948429Z         contiguous: bool,
2025-05-07T20:31:46.7948513Z         compiled: bool,
2025-05-07T20:31:46.7948586Z     ) -> None:
2025-05-07T20:31:46.7948676Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7948748Z     
2025-05-07T20:31:46.7948914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7948985Z     
2025-05-07T20:31:46.7949075Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7949194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7949279Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7949358Z         x0 = x[:, :D]
2025-05-07T20:31:46.7949431Z         x1 = x[:, D:]
2025-05-07T20:31:46.7949503Z     
2025-05-07T20:31:46.7949586Z         if contiguous:
2025-05-07T20:31:46.7949675Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7949759Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7949839Z     
2025-05-07T20:31:46.7949926Z         if scale_ub is not None:
2025-05-07T20:31:46.7950030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7950165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7950239Z             )
2025-05-07T20:31:46.7950321Z         else:
2025-05-07T20:31:46.7950412Z             scale_ub_tensor = None
2025-05-07T20:31:46.7950479Z     
2025-05-07T20:31:46.7950607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7950696Z             op = silu_mul_quant
2025-05-07T20:31:46.7950776Z             if compiled:
2025-05-07T20:31:46.7950875Z                 op = torch.compile(op)
2025-05-07T20:31:46.7950978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7951049Z     
2025-05-07T20:31:46.7951144Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7951148Z 
2025-05-07T20:31:46.7951245Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7951375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7951481Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7951582Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7951952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.7952043Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.7952528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7952628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7952982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7953202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7953542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7953723Z     kernel = self.compile(
2025-05-07T20:31:46.7954180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7954359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7954485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7954490Z 
2025-05-07T20:31:46.7954691Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fcb0880>
2025-05-07T20:31:46.7955457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.7955958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc11bd0>}
2025-05-07T20:31:46.7956700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.7956891Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fb73bf0>
2025-05-07T20:31:46.7956896Z 
2025-05-07T20:31:46.7957057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.7957314Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.7957418Z                            module_map=module_map)
2025-05-07T20:31:46.7957578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.7957681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.7957758Z E       ^
2025-05-07T20:31:46.7958105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.7958109Z 
2025-05-07T20:31:46.7958533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.7958543Z 
2025-05-07T20:31:46.7958644Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7958869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7958944Z     T=16384,
2025-05-07T20:31:46.7959015Z     D=5120,
2025-05-07T20:31:46.7959095Z     scale_ub=None,
2025-05-07T20:31:46.7959176Z     contiguous=False,
2025-05-07T20:31:46.7959257Z     compiled=False,
2025-05-07T20:31:46.7959328Z )
2025-05-07T20:31:46.7959540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7959713Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7959717Z 
2025-05-07T20:31:46.7959792Z     @given(
2025-05-07T20:31:46.7959907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7960002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7960123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7960242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7960359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7960431Z     )
2025-05-07T20:31:46.7960676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7960773Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7960848Z         self,
2025-05-07T20:31:46.7960922Z         T: int,
2025-05-07T20:31:46.7961001Z         D: int,
2025-05-07T20:31:46.7961098Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7961183Z         contiguous: bool,
2025-05-07T20:31:46.7961266Z         compiled: bool,
2025-05-07T20:31:46.7961341Z     ) -> None:
2025-05-07T20:31:46.7961429Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7961501Z     
2025-05-07T20:31:46.7961666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7961833Z     
2025-05-07T20:31:46.7961927Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7962125Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7963931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.7963937Z 
2025-05-07T20:31:46.7964053Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:46.7964058Z 
2025-05-07T20:31:46.7964162Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7964382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7964460Z     T=4096,
2025-05-07T20:31:46.7964538Z     D=7168,
2025-05-07T20:31:46.7964622Z     scale_ub=1200.0,
2025-05-07T20:31:46.7964703Z     contiguous=True,
2025-05-07T20:31:46.7964784Z     compiled=True,
2025-05-07T20:31:46.7964851Z )
2025-05-07T20:31:46.7965069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7965238Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7965242Z 
2025-05-07T20:31:46.7965323Z     @given(
2025-05-07T20:31:46.7965438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7965532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7969898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7970041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7970156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7970238Z     )
2025-05-07T20:31:46.7970497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7970599Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7970677Z         self,
2025-05-07T20:31:46.7970753Z         T: int,
2025-05-07T20:31:46.7970828Z         D: int,
2025-05-07T20:31:46.7970930Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7971020Z         contiguous: bool,
2025-05-07T20:31:46.7971108Z         compiled: bool,
2025-05-07T20:31:46.7971186Z     ) -> None:
2025-05-07T20:31:46.7971281Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7971359Z     
2025-05-07T20:31:46.7971525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7971600Z     
2025-05-07T20:31:46.7971693Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7971820Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7973638Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.7973652Z 
2025-05-07T20:31:46.7973772Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:46.7973777Z 
2025-05-07T20:31:46.7973879Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7974106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7974183Z     T=16384,
2025-05-07T20:31:46.7974262Z     D=7168,
2025-05-07T20:31:46.7974345Z     scale_ub=None,
2025-05-07T20:31:46.7974429Z     contiguous=False,
2025-05-07T20:31:46.7974514Z     compiled=False,
2025-05-07T20:31:46.7974696Z )
2025-05-07T20:31:46.7974911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7975159Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.7975164Z 
2025-05-07T20:31:46.7975243Z     @given(
2025-05-07T20:31:46.7975358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7975460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7975574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7975697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7975809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7975883Z     )
2025-05-07T20:31:46.7976134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7976227Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7976302Z         self,
2025-05-07T20:31:46.7976380Z         T: int,
2025-05-07T20:31:46.7976455Z         D: int,
2025-05-07T20:31:46.7976558Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7976652Z         contiguous: bool,
2025-05-07T20:31:46.7976743Z         compiled: bool,
2025-05-07T20:31:46.7976820Z     ) -> None:
2025-05-07T20:31:46.7976917Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7976989Z     
2025-05-07T20:31:46.7977157Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7978935Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.7978945Z 
2025-05-07T20:31:46.7979068Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.7979073Z 
2025-05-07T20:31:46.7979177Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7979396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7979475Z     T=2048,
2025-05-07T20:31:46.7979551Z     D=7168,
2025-05-07T20:31:46.7979634Z     scale_ub=1200.0,
2025-05-07T20:31:46.7979721Z     contiguous=True,
2025-05-07T20:31:46.7979903Z     compiled=True,
2025-05-07T20:31:46.7979978Z )
2025-05-07T20:31:46.7980197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7980365Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.7980369Z 
2025-05-07T20:31:46.7980452Z     @given(
2025-05-07T20:31:46.7980571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7980669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7980792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7980910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7981026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7981103Z     )
2025-05-07T20:31:46.7981344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7981436Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7981515Z         self,
2025-05-07T20:31:46.7981593Z         T: int,
2025-05-07T20:31:46.7981669Z         D: int,
2025-05-07T20:31:46.7981766Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7981853Z         contiguous: bool,
2025-05-07T20:31:46.7981939Z         compiled: bool,
2025-05-07T20:31:46.7982016Z     ) -> None:
2025-05-07T20:31:46.7982108Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7982184Z     
2025-05-07T20:31:46.7982346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7982423Z     
2025-05-07T20:31:46.7982600Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7982723Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7984547Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.7984554Z 
2025-05-07T20:31:46.7984669Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:46.7984674Z 
2025-05-07T20:31:46.7984777Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7984997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7985078Z     T=2048,
2025-05-07T20:31:46.7985158Z     D=7168,
2025-05-07T20:31:46.7985242Z     scale_ub=None,
2025-05-07T20:31:46.7985333Z     contiguous=True,
2025-05-07T20:31:46.7985418Z     compiled=False,
2025-05-07T20:31:46.7985490Z )
2025-05-07T20:31:46.7985701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7985873Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.7985878Z 
2025-05-07T20:31:46.7985953Z     @given(
2025-05-07T20:31:46.7986067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7986167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7986283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7986403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7986513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7986586Z     )
2025-05-07T20:31:46.7986833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7986933Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7987014Z         self,
2025-05-07T20:31:46.7987095Z         T: int,
2025-05-07T20:31:46.7987171Z         D: int,
2025-05-07T20:31:46.7987268Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7987363Z         contiguous: bool,
2025-05-07T20:31:46.7987447Z         compiled: bool,
2025-05-07T20:31:46.7987530Z     ) -> None:
2025-05-07T20:31:46.7987623Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7987695Z     
2025-05-07T20:31:46.7987863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7987941Z     
2025-05-07T20:31:46.7988032Z >       x_sign = torch.sign(x)
2025-05-07T20:31:46.7989789Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.7990188Z 
2025-05-07T20:31:46.7990332Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:46.7990337Z 
2025-05-07T20:31:46.7990442Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.7990660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.7990736Z     T=1,
2025-05-07T20:31:46.7990815Z     D=7168,
2025-05-07T20:31:46.7990897Z     scale_ub=1200.0,
2025-05-07T20:31:46.7990980Z     contiguous=True,
2025-05-07T20:31:46.7991065Z     compiled=False,
2025-05-07T20:31:46.7991137Z )
2025-05-07T20:31:46.7991350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.7991668Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.7991673Z 
2025-05-07T20:31:46.7991852Z     @given(
2025-05-07T20:31:46.7991977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.7992075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.7992186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.7992307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.7992421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.7992506Z     )
2025-05-07T20:31:46.7992747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.7992840Z     def test_silu_mul_quant(
2025-05-07T20:31:46.7992918Z         self,
2025-05-07T20:31:46.7992992Z         T: int,
2025-05-07T20:31:46.7993072Z         D: int,
2025-05-07T20:31:46.7993168Z         scale_ub: Optional[float],
2025-05-07T20:31:46.7993260Z         contiguous: bool,
2025-05-07T20:31:46.7993352Z         compiled: bool,
2025-05-07T20:31:46.7993430Z     ) -> None:
2025-05-07T20:31:46.7993532Z         torch.manual_seed(2025)
2025-05-07T20:31:46.7993606Z     
2025-05-07T20:31:46.7993768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.7993844Z     
2025-05-07T20:31:46.7993936Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.7994058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.7994149Z         x = x_sign * x_clamp
2025-05-07T20:31:46.7994229Z         x0 = x[:, :D]
2025-05-07T20:31:46.7994307Z         x1 = x[:, D:]
2025-05-07T20:31:46.7994380Z     
2025-05-07T20:31:46.7994465Z         if contiguous:
2025-05-07T20:31:46.7994557Z             x0 = x0.contiguous()
2025-05-07T20:31:46.7994645Z             x1 = x1.contiguous()
2025-05-07T20:31:46.7994715Z     
2025-05-07T20:31:46.7994810Z         if scale_ub is not None:
2025-05-07T20:31:46.7994913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.7995052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.7995132Z             )
2025-05-07T20:31:46.7995211Z         else:
2025-05-07T20:31:46.7995304Z             scale_ub_tensor = None
2025-05-07T20:31:46.7995380Z     
2025-05-07T20:31:46.7995507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.7995595Z             op = silu_mul_quant
2025-05-07T20:31:46.7995685Z             if compiled:
2025-05-07T20:31:46.7995785Z                 op = torch.compile(op)
2025-05-07T20:31:46.7995891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7995966Z     
2025-05-07T20:31:46.7996055Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.7996060Z 
2025-05-07T20:31:46.7996161Z moe/activation_test.py:117: 
2025-05-07T20:31:46.7996291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7996391Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.7996492Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.7997001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.7997101Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.7997467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.7997688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.7998034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.7998129Z     kernel = self.compile(
2025-05-07T20:31:46.7998507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.7998684Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.7998811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.7998900Z 
2025-05-07T20:31:46.7999110Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fa448e0>
2025-05-07T20:31:46.7999979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8000484Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fc13b50>}
2025-05-07T20:31:46.8001230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8001422Z context = <triton._C.libtriton.ir.context object at 0x7f1c4faf50f0>
2025-05-07T20:31:46.8001426Z 
2025-05-07T20:31:46.8001599Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8001863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8001969Z                            module_map=module_map)
2025-05-07T20:31:46.8002136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8002233Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8002311Z E       ^
2025-05-07T20:31:46.8002664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8002669Z 
2025-05-07T20:31:46.8003077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8003082Z 
2025-05-07T20:31:46.8003187Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8003405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8003491Z     T=128,
2025-05-07T20:31:46.8003566Z     D=5120,
2025-05-07T20:31:46.8003647Z     scale_ub=None,
2025-05-07T20:31:46.8003735Z     contiguous=True,
2025-05-07T20:31:46.8003820Z     compiled=False,
2025-05-07T20:31:46.8003893Z )
2025-05-07T20:31:46.8004111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8004279Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8004284Z 
2025-05-07T20:31:46.8004359Z     @given(
2025-05-07T20:31:46.8004479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8004577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8004692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8004810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8004922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8004998Z     )
2025-05-07T20:31:46.8005244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8005342Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8005420Z         self,
2025-05-07T20:31:46.8005501Z         T: int,
2025-05-07T20:31:46.8005577Z         D: int,
2025-05-07T20:31:46.8005680Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8005768Z         contiguous: bool,
2025-05-07T20:31:46.8005856Z         compiled: bool,
2025-05-07T20:31:46.8005939Z     ) -> None:
2025-05-07T20:31:46.8006033Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8006108Z     
2025-05-07T20:31:46.8006281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8006356Z     
2025-05-07T20:31:46.8006450Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8006572Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8006661Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8006745Z         x0 = x[:, :D]
2025-05-07T20:31:46.8006824Z         x1 = x[:, D:]
2025-05-07T20:31:46.8006895Z     
2025-05-07T20:31:46.8007069Z         if contiguous:
2025-05-07T20:31:46.8007160Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8007324Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8007401Z     
2025-05-07T20:31:46.8007493Z         if scale_ub is not None:
2025-05-07T20:31:46.8007596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8007734Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8007811Z             )
2025-05-07T20:31:46.8007890Z         else:
2025-05-07T20:31:46.8007985Z             scale_ub_tensor = None
2025-05-07T20:31:46.8008056Z     
2025-05-07T20:31:46.8008190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8008279Z             op = silu_mul_quant
2025-05-07T20:31:46.8008364Z             if compiled:
2025-05-07T20:31:46.8008466Z                 op = torch.compile(op)
2025-05-07T20:31:46.8008569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8008641Z     
2025-05-07T20:31:46.8008733Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8008744Z 
2025-05-07T20:31:46.8008842Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8008973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8009078Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8009174Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8009670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8009767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8010121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8010345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8010685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8010781Z     kernel = self.compile(
2025-05-07T20:31:46.8011169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8011348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8011480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8011484Z 
2025-05-07T20:31:46.8011687Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fa46320>
2025-05-07T20:31:46.8012459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8012956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fa54670>}
2025-05-07T20:31:46.8013697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8013896Z context = <triton._C.libtriton.ir.context object at 0x7f1c4fa66e70>
2025-05-07T20:31:46.8013900Z 
2025-05-07T20:31:46.8014066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8014327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8014433Z                            module_map=module_map)
2025-05-07T20:31:46.8014593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8014695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8014774Z E       ^
2025-05-07T20:31:46.8015129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8015137Z 
2025-05-07T20:31:46.8015632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8015636Z 
2025-05-07T20:31:46.8015810Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8016032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8016109Z     T=128,
2025-05-07T20:31:46.8016185Z     D=7168,
2025-05-07T20:31:46.8016271Z     scale_ub=None,
2025-05-07T20:31:46.8016355Z     contiguous=True,
2025-05-07T20:31:46.8016439Z     compiled=False,
2025-05-07T20:31:46.8016515Z )
2025-05-07T20:31:46.8016726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8016898Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8016902Z 
2025-05-07T20:31:46.8016981Z     @given(
2025-05-07T20:31:46.8017100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8017203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8017325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8017439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8017560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8017634Z     )
2025-05-07T20:31:46.8017881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8017977Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8018053Z         self,
2025-05-07T20:31:46.8018132Z         T: int,
2025-05-07T20:31:46.8018206Z         D: int,
2025-05-07T20:31:46.8018304Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8018395Z         contiguous: bool,
2025-05-07T20:31:46.8018481Z         compiled: bool,
2025-05-07T20:31:46.8018560Z     ) -> None:
2025-05-07T20:31:46.8018657Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8018727Z     
2025-05-07T20:31:46.8018893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8018969Z     
2025-05-07T20:31:46.8019065Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8019188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8019287Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8019368Z         x0 = x[:, :D]
2025-05-07T20:31:46.8019449Z         x1 = x[:, D:]
2025-05-07T20:31:46.8019523Z     
2025-05-07T20:31:46.8019605Z         if contiguous:
2025-05-07T20:31:46.8019700Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8019788Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8019924Z     
2025-05-07T20:31:46.8020018Z         if scale_ub is not None:
2025-05-07T20:31:46.8020119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8020252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8020329Z             )
2025-05-07T20:31:46.8020407Z         else:
2025-05-07T20:31:46.8020501Z             scale_ub_tensor = None
2025-05-07T20:31:46.8020574Z     
2025-05-07T20:31:46.8020700Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8020794Z             op = silu_mul_quant
2025-05-07T20:31:46.8020880Z             if compiled:
2025-05-07T20:31:46.8020985Z                 op = torch.compile(op)
2025-05-07T20:31:46.8021093Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8021163Z     
2025-05-07T20:31:46.8021252Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8021257Z 
2025-05-07T20:31:46.8021354Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8021480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8021579Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8021679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8022172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8022272Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8022631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8022939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8023355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8023450Z     kernel = self.compile(
2025-05-07T20:31:46.8023833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8024007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8024132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8024137Z 
2025-05-07T20:31:46.8024345Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fa46d40>
2025-05-07T20:31:46.8025110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8025616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fa54ee0>}
2025-05-07T20:31:46.8026360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8026549Z context = <triton._C.libtriton.ir.context object at 0x7f1c4f9b51f0>
2025-05-07T20:31:46.8026554Z 
2025-05-07T20:31:46.8026721Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8026979Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8027087Z                            module_map=module_map)
2025-05-07T20:31:46.8027249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8027355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8027434Z E       ^
2025-05-07T20:31:46.8027795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8027801Z 
2025-05-07T20:31:46.8028217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8028222Z 
2025-05-07T20:31:46.8028327Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8028545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8028626Z     T=2048,
2025-05-07T20:31:46.8028704Z     D=7168,
2025-05-07T20:31:46.8028788Z     scale_ub=1200.0,
2025-05-07T20:31:46.8028876Z     contiguous=True,
2025-05-07T20:31:46.8028958Z     compiled=False,
2025-05-07T20:31:46.8029031Z )
2025-05-07T20:31:46.8029248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8029425Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.8029430Z 
2025-05-07T20:31:46.8029508Z     @given(
2025-05-07T20:31:46.8029629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8029724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8029841Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8029958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8030071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8030147Z     )
2025-05-07T20:31:46.8030392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8030486Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8030570Z         self,
2025-05-07T20:31:46.8030645Z         T: int,
2025-05-07T20:31:46.8030718Z         D: int,
2025-05-07T20:31:46.8030819Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8030908Z         contiguous: bool,
2025-05-07T20:31:46.8031102Z         compiled: bool,
2025-05-07T20:31:46.8031183Z     ) -> None:
2025-05-07T20:31:46.8031352Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8031430Z     
2025-05-07T20:31:46.8031599Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8033364Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8033374Z 
2025-05-07T20:31:46.8033491Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8033504Z 
2025-05-07T20:31:46.8033607Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8033835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8033911Z     T=1,
2025-05-07T20:31:46.8033985Z     D=5120,
2025-05-07T20:31:46.8034072Z     scale_ub=1200.0,
2025-05-07T20:31:46.8034155Z     contiguous=True,
2025-05-07T20:31:46.8034238Z     compiled=False,
2025-05-07T20:31:46.8034314Z )
2025-05-07T20:31:46.8034526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8034693Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.8034697Z 
2025-05-07T20:31:46.8034774Z     @given(
2025-05-07T20:31:46.8034889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8034991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8035106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8035220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8035336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8035409Z     )
2025-05-07T20:31:46.8035659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8035756Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8035831Z         self,
2025-05-07T20:31:46.8035908Z         T: int,
2025-05-07T20:31:46.8035983Z         D: int,
2025-05-07T20:31:46.8036081Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8036172Z         contiguous: bool,
2025-05-07T20:31:46.8036256Z         compiled: bool,
2025-05-07T20:31:46.8036331Z     ) -> None:
2025-05-07T20:31:46.8036426Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8036498Z     
2025-05-07T20:31:46.8036663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8036739Z     
2025-05-07T20:31:46.8036829Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8036953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8037050Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8037129Z         x0 = x[:, :D]
2025-05-07T20:31:46.8037217Z         x1 = x[:, D:]
2025-05-07T20:31:46.8037287Z     
2025-05-07T20:31:46.8037369Z         if contiguous:
2025-05-07T20:31:46.8037465Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8037551Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8037624Z     
2025-05-07T20:31:46.8037716Z         if scale_ub is not None:
2025-05-07T20:31:46.8037819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8037951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8038029Z             )
2025-05-07T20:31:46.8038103Z         else:
2025-05-07T20:31:46.8038196Z             scale_ub_tensor = None
2025-05-07T20:31:46.8038271Z     
2025-05-07T20:31:46.8038396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8038486Z             op = silu_mul_quant
2025-05-07T20:31:46.8038570Z             if compiled:
2025-05-07T20:31:46.8038752Z                 op = torch.compile(op)
2025-05-07T20:31:46.8038862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8039007Z     
2025-05-07T20:31:46.8039099Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8039104Z 
2025-05-07T20:31:46.8039204Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8039329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8039430Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8039533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8040028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8040132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8040485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8040703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8041057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8041153Z     kernel = self.compile(
2025-05-07T20:31:46.8041531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8041708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8041834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8041838Z 
2025-05-07T20:31:46.8042047Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4fae9ba0>
2025-05-07T20:31:46.8042818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8043325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c4fa55e10>}
2025-05-07T20:31:46.8044081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8044273Z context = <triton._C.libtriton.ir.context object at 0x7f1c4f9aa070>
2025-05-07T20:31:46.8044278Z 
2025-05-07T20:31:46.8044446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8044710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8044821Z                            module_map=module_map)
2025-05-07T20:31:46.8044981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8045077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8045155Z E       ^
2025-05-07T20:31:46.8045507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8045516Z 
2025-05-07T20:31:46.8045925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8045934Z 
2025-05-07T20:31:46.8046037Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8046256Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8046335Z     T=2048,
2025-05-07T20:31:46.8046409Z     D=5120,
2025-05-07T20:31:46.8046488Z     scale_ub=None,
2025-05-07T20:31:46.8046574Z     contiguous=True,
2025-05-07T20:31:46.8046655Z     compiled=False,
2025-05-07T20:31:46.8046727Z )
2025-05-07T20:31:46.8046942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8047113Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8047202Z 
2025-05-07T20:31:46.8047280Z     @given(
2025-05-07T20:31:46.8047399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8047570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8047685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8047800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8047913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8047992Z     )
2025-05-07T20:31:46.8048231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8048324Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8048402Z         self,
2025-05-07T20:31:46.8048477Z         T: int,
2025-05-07T20:31:46.8048550Z         D: int,
2025-05-07T20:31:46.8048651Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8048739Z         contiguous: bool,
2025-05-07T20:31:46.8048825Z         compiled: bool,
2025-05-07T20:31:46.8048901Z     ) -> None:
2025-05-07T20:31:46.8049000Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8049073Z     
2025-05-07T20:31:46.8049246Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8049320Z     
2025-05-07T20:31:46.8049414Z >       x_sign = torch.sign(x)
2025-05-07T20:31:46.8051175Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8051181Z 
2025-05-07T20:31:46.8051302Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:46.8051307Z 
2025-05-07T20:31:46.8051413Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8051630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8051712Z     T=16384,
2025-05-07T20:31:46.8051789Z     D=5120,
2025-05-07T20:31:46.8051878Z     scale_ub=None,
2025-05-07T20:31:46.8051961Z     contiguous=True,
2025-05-07T20:31:46.8052043Z     compiled=False,
2025-05-07T20:31:46.8052116Z )
2025-05-07T20:31:46.8052327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8052503Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8052507Z 
2025-05-07T20:31:46.8052583Z     @given(
2025-05-07T20:31:46.8052700Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8052801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8052911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8053028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8053147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8053220Z     )
2025-05-07T20:31:46.8053474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8053567Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8053642Z         self,
2025-05-07T20:31:46.8053718Z         T: int,
2025-05-07T20:31:46.8053792Z         D: int,
2025-05-07T20:31:46.8053889Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8053978Z         contiguous: bool,
2025-05-07T20:31:46.8054062Z         compiled: bool,
2025-05-07T20:31:46.8054138Z     ) -> None:
2025-05-07T20:31:46.8054239Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8054312Z     
2025-05-07T20:31:46.8054478Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8056333Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8056487Z 
2025-05-07T20:31:46.8056606Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8056613Z 
2025-05-07T20:31:46.8056714Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8056929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8057007Z     T=4096,
2025-05-07T20:31:46.8057082Z     D=5120,
2025-05-07T20:31:46.8057167Z     scale_ub=None,
2025-05-07T20:31:46.8057254Z     contiguous=True,
2025-05-07T20:31:46.8057336Z     compiled=False,
2025-05-07T20:31:46.8057407Z )
2025-05-07T20:31:46.8057618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8057795Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8057806Z 
2025-05-07T20:31:46.8057885Z     @given(
2025-05-07T20:31:46.8058002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8058098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8058216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8058331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8058444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8058522Z     )
2025-05-07T20:31:46.8058765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8058857Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8058936Z         self,
2025-05-07T20:31:46.8059010Z         T: int,
2025-05-07T20:31:46.8059084Z         D: int,
2025-05-07T20:31:46.8059184Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8059277Z         contiguous: bool,
2025-05-07T20:31:46.8059366Z         compiled: bool,
2025-05-07T20:31:46.8059443Z     ) -> None:
2025-05-07T20:31:46.8059541Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8059620Z     
2025-05-07T20:31:46.8059783Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8061591Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8061599Z 
2025-05-07T20:31:46.8061718Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8061728Z 
2025-05-07T20:31:46.8061830Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8062054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8062131Z     T=2048,
2025-05-07T20:31:46.8062206Z     D=5120,
2025-05-07T20:31:46.8062294Z     scale_ub=None,
2025-05-07T20:31:46.8062380Z     contiguous=False,
2025-05-07T20:31:46.8062462Z     compiled=False,
2025-05-07T20:31:46.8062538Z )
2025-05-07T20:31:46.8062748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8062919Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.8062924Z 
2025-05-07T20:31:46.8062999Z     @given(
2025-05-07T20:31:46.8063114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8063212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8063325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8063552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8063668Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8063814Z     )
2025-05-07T20:31:46.8064064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8064157Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8064231Z         self,
2025-05-07T20:31:46.8064306Z         T: int,
2025-05-07T20:31:46.8064379Z         D: int,
2025-05-07T20:31:46.8064476Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8064566Z         contiguous: bool,
2025-05-07T20:31:46.8064650Z         compiled: bool,
2025-05-07T20:31:46.8064725Z     ) -> None:
2025-05-07T20:31:46.8064821Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8064894Z     
2025-05-07T20:31:46.8065059Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8066871Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8066882Z 
2025-05-07T20:31:46.8067000Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8067007Z 
2025-05-07T20:31:46.8067111Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8067330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8067410Z     T=4096,
2025-05-07T20:31:46.8067485Z     D=7168,
2025-05-07T20:31:46.8067570Z     scale_ub=None,
2025-05-07T20:31:46.8067654Z     contiguous=True,
2025-05-07T20:31:46.8067736Z     compiled=True,
2025-05-07T20:31:46.8067813Z )
2025-05-07T20:31:46.8068029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8068201Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.8068206Z 
2025-05-07T20:31:46.8068280Z     @given(
2025-05-07T20:31:46.8068396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8068490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8068603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8068716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8068826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8068901Z     )
2025-05-07T20:31:46.8069147Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8069239Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8069319Z         self,
2025-05-07T20:31:46.8069393Z         T: int,
2025-05-07T20:31:46.8069467Z         D: int,
2025-05-07T20:31:46.8069571Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8069659Z         contiguous: bool,
2025-05-07T20:31:46.8069750Z         compiled: bool,
2025-05-07T20:31:46.8069826Z     ) -> None:
2025-05-07T20:31:46.8069919Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8069995Z     
2025-05-07T20:31:46.8070158Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8071915Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8072010Z 
2025-05-07T20:31:46.8072128Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8072132Z 
2025-05-07T20:31:46.8072306Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8072531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8072606Z     T=2048,
2025-05-07T20:31:46.8072682Z     D=5120,
2025-05-07T20:31:46.8072767Z     scale_ub=1200.0,
2025-05-07T20:31:46.8072850Z     contiguous=False,
2025-05-07T20:31:46.8072932Z     compiled=False,
2025-05-07T20:31:46.8073008Z )
2025-05-07T20:31:46.8073219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8073393Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.8073398Z 
2025-05-07T20:31:46.8073473Z     @given(
2025-05-07T20:31:46.8073588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8073686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8073805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8073919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8074039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8074112Z     )
2025-05-07T20:31:46.8074361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8074454Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8074527Z         self,
2025-05-07T20:31:46.8074603Z         T: int,
2025-05-07T20:31:46.8074677Z         D: int,
2025-05-07T20:31:46.8074775Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8074867Z         contiguous: bool,
2025-05-07T20:31:46.8074951Z         compiled: bool,
2025-05-07T20:31:46.8075028Z     ) -> None:
2025-05-07T20:31:46.8075125Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8075196Z     
2025-05-07T20:31:46.8075358Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8077120Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8077131Z 
2025-05-07T20:31:46.8077247Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8077254Z 
2025-05-07T20:31:46.8077354Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8077571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8077650Z     T=4096,
2025-05-07T20:31:46.8077724Z     D=7168,
2025-05-07T20:31:46.8077805Z     scale_ub=1200.0,
2025-05-07T20:31:46.8077896Z     contiguous=True,
2025-05-07T20:31:46.8077978Z     compiled=False,
2025-05-07T20:31:46.8078049Z )
2025-05-07T20:31:46.8078271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8078442Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.8078446Z 
2025-05-07T20:31:46.8078524Z     @given(
2025-05-07T20:31:46.8078642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8078739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8078853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8078966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8079078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8079154Z     )
2025-05-07T20:31:46.8079398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8079490Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8079566Z         self,
2025-05-07T20:31:46.8079726Z         T: int,
2025-05-07T20:31:46.8079801Z         D: int,
2025-05-07T20:31:46.8079901Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8080064Z         contiguous: bool,
2025-05-07T20:31:46.8080158Z         compiled: bool,
2025-05-07T20:31:46.8080234Z     ) -> None:
2025-05-07T20:31:46.8080326Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8080401Z     
2025-05-07T20:31:46.8080566Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8082318Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8082331Z 
2025-05-07T20:31:46.8082452Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8082457Z 
2025-05-07T20:31:46.8082557Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8082778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8082854Z     T=16384,
2025-05-07T20:31:46.8082930Z     D=7168,
2025-05-07T20:31:46.8083020Z     scale_ub=None,
2025-05-07T20:31:46.8083105Z     contiguous=False,
2025-05-07T20:31:46.8083187Z     compiled=True,
2025-05-07T20:31:46.8083261Z )
2025-05-07T20:31:46.8083471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8083652Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.8083657Z 
2025-05-07T20:31:46.8083732Z     @given(
2025-05-07T20:31:46.8083848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8083954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8084064Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8084183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8084297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8084373Z     )
2025-05-07T20:31:46.8084622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8084714Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8084789Z         self,
2025-05-07T20:31:46.8084868Z         T: int,
2025-05-07T20:31:46.8084941Z         D: int,
2025-05-07T20:31:46.8085038Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8085129Z         contiguous: bool,
2025-05-07T20:31:46.8085214Z         compiled: bool,
2025-05-07T20:31:46.8085290Z     ) -> None:
2025-05-07T20:31:46.8085388Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8085460Z     
2025-05-07T20:31:46.8085624Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8087405Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8087410Z 
2025-05-07T20:31:46.8087528Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8087535Z 
2025-05-07T20:31:46.8087636Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8087853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8087932Z     T=4096,
2025-05-07T20:31:46.8088007Z     D=7168,
2025-05-07T20:31:46.8088173Z     scale_ub=None,
2025-05-07T20:31:46.8088260Z     contiguous=True,
2025-05-07T20:31:46.8088347Z     compiled=False,
2025-05-07T20:31:46.8088518Z )
2025-05-07T20:31:46.8088733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8088903Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8088907Z 
2025-05-07T20:31:46.8088983Z     @given(
2025-05-07T20:31:46.8089103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8089200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8089312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8089425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8089536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8089613Z     )
2025-05-07T20:31:46.8090121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8090247Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8090330Z         self,
2025-05-07T20:31:46.8090404Z         T: int,
2025-05-07T20:31:46.8090490Z         D: int,
2025-05-07T20:31:46.8090588Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8090675Z         contiguous: bool,
2025-05-07T20:31:46.8094875Z         compiled: bool,
2025-05-07T20:31:46.8094973Z     ) -> None:
2025-05-07T20:31:46.8095074Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8095152Z     
2025-05-07T20:31:46.8095333Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8097121Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8097133Z 
2025-05-07T20:31:46.8097254Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8097258Z 
2025-05-07T20:31:46.8097365Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8097589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8097666Z     T=16384,
2025-05-07T20:31:46.8097745Z     D=7168,
2025-05-07T20:31:46.8097826Z     scale_ub=None,
2025-05-07T20:31:46.8097909Z     contiguous=True,
2025-05-07T20:31:46.8097995Z     compiled=False,
2025-05-07T20:31:46.8098068Z )
2025-05-07T20:31:46.8098282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8098457Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:46.8098462Z 
2025-05-07T20:31:46.8098537Z     @given(
2025-05-07T20:31:46.8098662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8098757Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8098876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8099000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8099112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8099187Z     )
2025-05-07T20:31:46.8099432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8099525Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8099599Z         self,
2025-05-07T20:31:46.8099677Z         T: int,
2025-05-07T20:31:46.8099751Z         D: int,
2025-05-07T20:31:46.8099917Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8100012Z         contiguous: bool,
2025-05-07T20:31:46.8100097Z         compiled: bool,
2025-05-07T20:31:46.8100180Z     ) -> None:
2025-05-07T20:31:46.8100277Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8100517Z     
2025-05-07T20:31:46.8100700Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8102574Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8102581Z 
2025-05-07T20:31:46.8102704Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8102708Z 
2025-05-07T20:31:46.8102814Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8103032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8103117Z     T=16384,
2025-05-07T20:31:46.8103191Z     D=7168,
2025-05-07T20:31:46.8103273Z     scale_ub=1200.0,
2025-05-07T20:31:46.8103365Z     contiguous=True,
2025-05-07T20:31:46.8103448Z     compiled=False,
2025-05-07T20:31:46.8103520Z )
2025-05-07T20:31:46.8103735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8103908Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.8103912Z 
2025-05-07T20:31:46.8103989Z     @given(
2025-05-07T20:31:46.8104104Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8104203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8104323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8104437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8104548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8104624Z     )
2025-05-07T20:31:46.8104871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8104971Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8105050Z         self,
2025-05-07T20:31:46.8105126Z         T: int,
2025-05-07T20:31:46.8105204Z         D: int,
2025-05-07T20:31:46.8105302Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8105390Z         contiguous: bool,
2025-05-07T20:31:46.8105482Z         compiled: bool,
2025-05-07T20:31:46.8105560Z     ) -> None:
2025-05-07T20:31:46.8105654Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8105734Z     
2025-05-07T20:31:46.8105906Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8107678Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8107687Z 
2025-05-07T20:31:46.8107804Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8107809Z 
2025-05-07T20:31:46.8107911Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8108132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8108209Z     T=128,
2025-05-07T20:31:46.8108290Z     D=5120,
2025-05-07T20:31:46.8108371Z     scale_ub=1200.0,
2025-05-07T20:31:46.8108455Z     contiguous=False,
2025-05-07T20:31:46.8108540Z     compiled=False,
2025-05-07T20:31:46.8108612Z )
2025-05-07T20:31:46.8108825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8109000Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.8109086Z 
2025-05-07T20:31:46.8109162Z     @given(
2025-05-07T20:31:46.8109285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8109454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8109568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8109684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8109799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8109872Z     )
2025-05-07T20:31:46.8110115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8110209Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8110283Z         self,
2025-05-07T20:31:46.8110361Z         T: int,
2025-05-07T20:31:46.8110437Z         D: int,
2025-05-07T20:31:46.8110536Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8110627Z         contiguous: bool,
2025-05-07T20:31:46.8110711Z         compiled: bool,
2025-05-07T20:31:46.8110790Z     ) -> None:
2025-05-07T20:31:46.8110891Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8110962Z     
2025-05-07T20:31:46.8111133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8111208Z     
2025-05-07T20:31:46.8111301Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8111428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8111517Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8111596Z         x0 = x[:, :D]
2025-05-07T20:31:46.8111677Z         x1 = x[:, D:]
2025-05-07T20:31:46.8111748Z     
2025-05-07T20:31:46.8111831Z         if contiguous:
2025-05-07T20:31:46.8111926Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8112014Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8112086Z     
2025-05-07T20:31:46.8112181Z         if scale_ub is not None:
2025-05-07T20:31:46.8112284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8112422Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8112499Z             )
2025-05-07T20:31:46.8112572Z         else:
2025-05-07T20:31:46.8112668Z             scale_ub_tensor = None
2025-05-07T20:31:46.8112748Z     
2025-05-07T20:31:46.8112874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8112967Z             op = silu_mul_quant
2025-05-07T20:31:46.8113051Z             if compiled:
2025-05-07T20:31:46.8113150Z                 op = torch.compile(op)
2025-05-07T20:31:46.8113256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8113327Z     
2025-05-07T20:31:46.8113415Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8113422Z 
2025-05-07T20:31:46.8113515Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8113639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8113741Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8113837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8114332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8114436Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8114793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8115021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8115358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8115451Z     kernel = self.compile(
2025-05-07T20:31:46.8115832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8116005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8116131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8116136Z 
2025-05-07T20:31:46.8116343Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c541fa4d0>
2025-05-07T20:31:46.8117335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8117844Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c54125cf0>}
2025-05-07T20:31:46.8118585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8118770Z context = <triton._C.libtriton.ir.context object at 0x7f1c5421e370>
2025-05-07T20:31:46.8118775Z 
2025-05-07T20:31:46.8118939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8119196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8119310Z                            module_map=module_map)
2025-05-07T20:31:46.8119473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8119568Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8119644Z E       ^
2025-05-07T20:31:46.8119996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8120001Z 
2025-05-07T20:31:46.8120408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8120413Z 
2025-05-07T20:31:46.8120518Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8120734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8120807Z     T=2048,
2025-05-07T20:31:46.8120879Z     D=7168,
2025-05-07T20:31:46.8120956Z     scale_ub=None,
2025-05-07T20:31:46.8121044Z     contiguous=False,
2025-05-07T20:31:46.8121126Z     compiled=False,
2025-05-07T20:31:46.8121197Z )
2025-05-07T20:31:46.8121413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8121587Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:46.8121592Z 
2025-05-07T20:31:46.8121662Z     @given(
2025-05-07T20:31:46.8121779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8121874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8121986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8122099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8122208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8122277Z     )
2025-05-07T20:31:46.8122520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8122612Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8122693Z         self,
2025-05-07T20:31:46.8122771Z         T: int,
2025-05-07T20:31:46.8122840Z         D: int,
2025-05-07T20:31:46.8122943Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8123028Z         contiguous: bool,
2025-05-07T20:31:46.8123113Z         compiled: bool,
2025-05-07T20:31:46.8123191Z     ) -> None:
2025-05-07T20:31:46.8123285Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8123354Z     
2025-05-07T20:31:46.8123522Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8125284Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8125371Z 
2025-05-07T20:31:46.8125586Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8125591Z 
2025-05-07T20:31:46.8125692Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8125909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8125989Z     T=128,
2025-05-07T20:31:46.8126062Z     D=7168,
2025-05-07T20:31:46.8126144Z     scale_ub=1200.0,
2025-05-07T20:31:46.8126225Z     contiguous=True,
2025-05-07T20:31:46.8126301Z     compiled=True,
2025-05-07T20:31:46.8126372Z )
2025-05-07T20:31:46.8126585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8126753Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.8126758Z 
2025-05-07T20:31:46.8126835Z     @given(
2025-05-07T20:31:46.8126949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8127050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8127168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8127281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8127394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8127464Z     )
2025-05-07T20:31:46.8127705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8127797Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8127870Z         self,
2025-05-07T20:31:46.8127941Z         T: int,
2025-05-07T20:31:46.8128015Z         D: int,
2025-05-07T20:31:46.8128111Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8128198Z         contiguous: bool,
2025-05-07T20:31:46.8128284Z         compiled: bool,
2025-05-07T20:31:46.8128359Z     ) -> None:
2025-05-07T20:31:46.8128449Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8128525Z     
2025-05-07T20:31:46.8128688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8128769Z     
2025-05-07T20:31:46.8128856Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8128982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8129073Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8129148Z         x0 = x[:, :D]
2025-05-07T20:31:46.8129225Z         x1 = x[:, D:]
2025-05-07T20:31:46.8129297Z     
2025-05-07T20:31:46.8129375Z         if contiguous:
2025-05-07T20:31:46.8129464Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8129553Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8129621Z     
2025-05-07T20:31:46.8129710Z         if scale_ub is not None:
2025-05-07T20:31:46.8129814Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8129945Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8130023Z             )
2025-05-07T20:31:46.8130099Z         else:
2025-05-07T20:31:46.8130191Z             scale_ub_tensor = None
2025-05-07T20:31:46.8130270Z     
2025-05-07T20:31:46.8130396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8130487Z             op = silu_mul_quant
2025-05-07T20:31:46.8130571Z             if compiled:
2025-05-07T20:31:46.8130666Z                 op = torch.compile(op)
2025-05-07T20:31:46.8130767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8130839Z     
2025-05-07T20:31:46.8130926Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8130931Z 
2025-05-07T20:31:46.8131025Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8131154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8131253Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8131353Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8131718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.8131807Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.8132392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8132562Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8132924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8133145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8133480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8133575Z     kernel = self.compile(
2025-05-07T20:31:46.8133951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8134122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8134249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8134260Z 
2025-05-07T20:31:46.8134462Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c4f89d8a0>
2025-05-07T20:31:46.8135236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8135727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1cb3f6ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c541270a0>}
2025-05-07T20:31:46.8136466Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8136684Z context = <triton._C.libtriton.ir.context object at 0x7f1c4f8d0d70>
2025-05-07T20:31:46.8136690Z 
2025-05-07T20:31:46.8136870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8137138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8137241Z                            module_map=module_map)
2025-05-07T20:31:46.8137399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8137499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8137574Z E       ^
2025-05-07T20:31:46.8137926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8137931Z 
2025-05-07T20:31:46.8138338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8138343Z 
2025-05-07T20:31:46.8138446Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8138666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8138744Z     T=128,
2025-05-07T20:31:46.8138817Z     D=7168,
2025-05-07T20:31:46.8138898Z     scale_ub=1200.0,
2025-05-07T20:31:46.8138977Z     contiguous=True,
2025-05-07T20:31:46.8139062Z     compiled=False,
2025-05-07T20:31:46.8139130Z )
2025-05-07T20:31:46.8139341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8139511Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.8139516Z 
2025-05-07T20:31:46.8139586Z     @given(
2025-05-07T20:31:46.8139700Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8139800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8140001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8140117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8140231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8140304Z     )
2025-05-07T20:31:46.8140546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8140724Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8140797Z         self,
2025-05-07T20:31:46.8140872Z         T: int,
2025-05-07T20:31:46.8141018Z         D: int,
2025-05-07T20:31:46.8141117Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8141206Z         contiguous: bool,
2025-05-07T20:31:46.8141287Z         compiled: bool,
2025-05-07T20:31:46.8141359Z     ) -> None:
2025-05-07T20:31:46.8141456Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8141526Z     
2025-05-07T20:31:46.8141693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8141770Z     
2025-05-07T20:31:46.8141861Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8141982Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8143743Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8143754Z 
2025-05-07T20:31:46.8143871Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:46.8143876Z 
2025-05-07T20:31:46.8143976Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8144194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8144271Z     T=128,
2025-05-07T20:31:46.8144346Z     D=5120,
2025-05-07T20:31:46.8144424Z     scale_ub=1200.0,
2025-05-07T20:31:46.8144510Z     contiguous=True,
2025-05-07T20:31:46.8144589Z     compiled=True,
2025-05-07T20:31:46.8144659Z )
2025-05-07T20:31:46.8144872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8145041Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.8145046Z 
2025-05-07T20:31:46.8145129Z     @given(
2025-05-07T20:31:46.8145243Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8145336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8145450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8145563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8145673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8145744Z     )
2025-05-07T20:31:46.8145987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8146077Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8146152Z         self,
2025-05-07T20:31:46.8146222Z         T: int,
2025-05-07T20:31:46.8146295Z         D: int,
2025-05-07T20:31:46.8146393Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8146478Z         contiguous: bool,
2025-05-07T20:31:46.8146566Z         compiled: bool,
2025-05-07T20:31:46.8146638Z     ) -> None:
2025-05-07T20:31:46.8146733Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8146808Z     
2025-05-07T20:31:46.8146969Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8147037Z     
2025-05-07T20:31:46.8147128Z >       x_sign = torch.sign(x)
2025-05-07T20:31:46.8148879Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8148971Z 
2025-05-07T20:31:46.8149092Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:46.8149096Z 
2025-05-07T20:31:46.8149269Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8149489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8149559Z     T=128,
2025-05-07T20:31:46.8149631Z     D=7168,
2025-05-07T20:31:46.8149711Z     scale_ub=None,
2025-05-07T20:31:46.8149790Z     contiguous=True,
2025-05-07T20:31:46.8149867Z     compiled=True,
2025-05-07T20:31:46.8149939Z )
2025-05-07T20:31:46.8150148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8150310Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:46.8150315Z 
2025-05-07T20:31:46.8150388Z     @given(
2025-05-07T20:31:46.8150500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8150595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8150705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8150824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8150943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8151011Z     )
2025-05-07T20:31:46.8151256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8151349Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8151424Z         self,
2025-05-07T20:31:46.8151494Z         T: int,
2025-05-07T20:31:46.8151571Z         D: int,
2025-05-07T20:31:46.8151667Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8151752Z         contiguous: bool,
2025-05-07T20:31:46.8151835Z         compiled: bool,
2025-05-07T20:31:46.8151911Z     ) -> None:
2025-05-07T20:31:46.8152001Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8152075Z     
2025-05-07T20:31:46.8152241Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8153995Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:46.8154005Z 
2025-05-07T20:31:46.8154119Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:46.8154252Z =============================== warnings summary ===============================
2025-05-07T20:31:46.8154561Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:46.8154856Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:46.8155159Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:46.8156028Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:46.8156257Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:46.8156261Z 
2025-05-07T20:31:46.8156437Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:46.8157700Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:46.8157991Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:46.8158067Z 
2025-05-07T20:31:46.8158278Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:46.8158443Z ================== 1 failed, 1 passed, 13 warnings in 29.73s ===================
2025-05-07T20:31:48.5859610Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:48.6487870Z 
2025-05-07T20:31:48.6488920Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:48.6489303Z 
2025-05-07T20:31:48.6489309Z 
2025-05-07T20:31:48.6510143Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:50.7966320Z ============================= test session starts ==============================
2025-05-07T20:31:50.7967326Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:50.7967933Z cachedir: .pytest_cache
2025-05-07T20:31:50.7968515Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:50.7969240Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:50.7969652Z plugins: hypothesis-6.131.14
2025-05-07T20:31:52.4050848Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:52.5818396Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:52.5819205Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:52.5819647Z 
2025-05-07T20:31:54.7049004Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.7050148Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:54.7051480Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.7053008Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.7054380Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.7055776Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.7057083Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.7058444Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.7059927Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.7061577Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:54.7062947Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.7064174Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:54.7065211Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:54.7066220Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:54.7067439Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.7068734Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.7069844Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.7070870Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:54.7072039Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.7073394Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.7074455Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.7075369Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.7076099Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:54.7077106Z W0507 20:31:54.703000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.7215657Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.7216952Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:54.7218291Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.7219698Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.7221170Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.7222960Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.7224251Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.7225613Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.7227012Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.7228243Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:54.7229513Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.7230721Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:54.7231741Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:54.7232745Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:54.7233948Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.7235235Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.7236337Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.7237359Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:54.7238527Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.7239880Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.7240935Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.7241837Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.7242569Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:54.7243573Z W0507 20:31:54.721000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2924629Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.2925310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.2926068Z     T=1,
2025-05-07T20:31:55.2926260Z     D=5120,
2025-05-07T20:31:55.2926600Z     scale_ub=None,
2025-05-07T20:31:55.2926814Z     contiguous=True,
2025-05-07T20:31:55.2927040Z     compiled=True,
2025-05-07T20:31:55.2927252Z )
2025-05-07T20:31:55.2927569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.2928063Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:55.2928321Z 
2025-05-07T20:31:55.2928405Z     @given(
2025-05-07T20:31:55.2928640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.2928957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.2929265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.2929599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.2929921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.2930210Z     )
2025-05-07T20:31:55.2930572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.2931004Z     def test_silu_mul_quant(
2025-05-07T20:31:55.2931255Z         self,
2025-05-07T20:31:55.2931454Z         T: int,
2025-05-07T20:31:55.2931650Z         D: int,
2025-05-07T20:31:55.2931875Z         scale_ub: Optional[float],
2025-05-07T20:31:55.2932152Z         contiguous: bool,
2025-05-07T20:31:55.2932387Z         compiled: bool,
2025-05-07T20:31:55.2932617Z     ) -> None:
2025-05-07T20:31:55.2932840Z         torch.manual_seed(2025)
2025-05-07T20:31:55.2933076Z     
2025-05-07T20:31:55.2933355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.2933699Z     
2025-05-07T20:31:55.2933894Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.2934183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.2934495Z         x = x_sign * x_clamp
2025-05-07T20:31:55.2934740Z         x0 = x[:, :D]
2025-05-07T20:31:55.2934951Z         x1 = x[:, D:]
2025-05-07T20:31:55.2935167Z     
2025-05-07T20:31:55.2935357Z         if contiguous:
2025-05-07T20:31:55.2935586Z             x0 = x0.contiguous()
2025-05-07T20:31:55.2935849Z             x1 = x1.contiguous()
2025-05-07T20:31:55.2936092Z     
2025-05-07T20:31:55.2936284Z         if scale_ub is not None:
2025-05-07T20:31:55.2936565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.2936906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.2937207Z             )
2025-05-07T20:31:55.2937405Z         else:
2025-05-07T20:31:55.2937627Z             scale_ub_tensor = None
2025-05-07T20:31:55.2937876Z     
2025-05-07T20:31:55.2938113Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2938426Z             op = silu_mul_quant
2025-05-07T20:31:55.2938682Z             if compiled:
2025-05-07T20:31:55.2938929Z                 op = torch.compile(op)
2025-05-07T20:31:55.2939230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2939515Z     
2025-05-07T20:31:55.2939705Z         y_fp8, y_scale = fn()
2025-05-07T20:31:55.2940089Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:55.2940398Z     
2025-05-07T20:31:55.2940635Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2940974Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:55.2941275Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:55.2941586Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:55.2941944Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2942259Z     
2025-05-07T20:31:55.2942457Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:55.2942660Z 
2025-05-07T20:31:55.2942764Z moe/activation_test.py:126: 
2025-05-07T20:31:55.2943062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2943396Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:55.2943719Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2944738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:55.2945483Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:55.2946027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.2946699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.2947380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:55.2948101Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.2948842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:55.2949584Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.2950315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:55.2950951Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:55.2951541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:55.2952062Z     fn()
2025-05-07T20:31:55.2952568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:55.2953150Z     self.fn.run(
2025-05-07T20:31:55.2953610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.2954137Z     kernel = self.compile(
2025-05-07T20:31:55.2954676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.2955325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2955724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2955955Z 
2025-05-07T20:31:55.2956161Z self = <triton.compiler.compiler.ASTSource object at 0x7f07cf4c4040>
2025-05-07T20:31:55.2957235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.2958640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07cfc6f400>}
2025-05-07T20:31:55.2959988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.2970086Z context = <triton._C.libtriton.ir.context object at 0x7f07cfe65130>
2025-05-07T20:31:55.2970538Z 
2025-05-07T20:31:55.2970730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.2971274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2971756Z                            module_map=module_map)
2025-05-07T20:31:55.2972139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2972511Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:55.2972797Z E       ^
2025-05-07T20:31:55.2973275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2973734Z 
2025-05-07T20:31:55.2974166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.2974689Z 
2025-05-07T20:31:55.2974932Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.2975363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.2975852Z     T=2048,
2025-05-07T20:31:55.2976060Z     D=5120,
2025-05-07T20:31:55.2976267Z     scale_ub=1200.0,
2025-05-07T20:31:55.2976498Z     contiguous=True,
2025-05-07T20:31:55.2976735Z     compiled=False,
2025-05-07T20:31:55.2976961Z )
2025-05-07T20:31:56.2176260Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.2177354Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:56.2178688Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.2180237Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.2181634Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.2183017Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.2184315Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.2185693Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.2187102Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.2188334Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:56.2189562Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.2190985Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:56.2192046Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:56.2193059Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:56.2194279Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.2195563Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.2196680Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.2198245Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:56.2199421Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.2200779Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.2201843Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.2202760Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.2203516Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:56.2204534Z W0507 20:31:56.213000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.4216147Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.4217753Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:56.4219645Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.4221220Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.4222610Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.4223999Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.4225298Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.4226684Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.4228112Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.4229370Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:56.4230585Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.4231786Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:56.4233339Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:56.4234368Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:56.4235577Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.4236870Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.4237968Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.4239006Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:56.4240231Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.4241594Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.4242646Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.4243549Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.4244290Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:56.4245316Z W0507 20:31:56.418000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1806401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1807112Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.1807397Z 
2025-05-07T20:31:57.1807481Z     @given(
2025-05-07T20:31:57.1807725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1808042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1808344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1808677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1809008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1809291Z     )
2025-05-07T20:31:57.1809686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1810139Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1810408Z         self,
2025-05-07T20:31:57.1810619Z         T: int,
2025-05-07T20:31:57.1810831Z         D: int,
2025-05-07T20:31:57.1811068Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1811367Z         contiguous: bool,
2025-05-07T20:31:57.1811619Z         compiled: bool,
2025-05-07T20:31:57.1811857Z     ) -> None:
2025-05-07T20:31:57.1812074Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1812323Z     
2025-05-07T20:31:57.1812602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1812942Z     
2025-05-07T20:31:57.1813144Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1813439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1813747Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1813994Z         x0 = x[:, :D]
2025-05-07T20:31:57.1814215Z         x1 = x[:, D:]
2025-05-07T20:31:57.1814820Z     
2025-05-07T20:31:57.1815016Z         if contiguous:
2025-05-07T20:31:57.1815263Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1815651Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1815905Z     
2025-05-07T20:31:57.1816107Z         if scale_ub is not None:
2025-05-07T20:31:57.1816381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1816723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1817038Z             )
2025-05-07T20:31:57.1817240Z         else:
2025-05-07T20:31:57.1817454Z             scale_ub_tensor = None
2025-05-07T20:31:57.1817707Z     
2025-05-07T20:31:57.1817944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1818253Z             op = silu_mul_quant
2025-05-07T20:31:57.1818507Z             if compiled:
2025-05-07T20:31:57.1818758Z                 op = torch.compile(op)
2025-05-07T20:31:57.1819053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1819340Z     
2025-05-07T20:31:57.1819540Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.1819707Z 
2025-05-07T20:31:57.1819917Z moe/activation_test.py:117: 
2025-05-07T20:31:57.1820224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1820558Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.1820848Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1821535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.1822225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.1822764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1823438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1824101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1824645Z     kernel = self.compile(
2025-05-07T20:31:57.1825191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1825838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1826237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1826463Z 
2025-05-07T20:31:57.1826678Z self = <triton.compiler.compiler.ASTSource object at 0x7f07cfaa5300>
2025-05-07T20:31:57.1827753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1829120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07cf43eef0>}
2025-05-07T20:31:57.1830474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1831491Z context = <triton._C.libtriton.ir.context object at 0x7f07ce545570>
2025-05-07T20:31:57.1831780Z 
2025-05-07T20:31:57.1831953Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1832470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1832946Z                            module_map=module_map)
2025-05-07T20:31:57.1833317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1833674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.1833939Z E       ^
2025-05-07T20:31:57.1834414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1834947Z 
2025-05-07T20:31:57.1835439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1835947Z 
2025-05-07T20:31:57.1836054Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1836467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1836871Z     T=2048,
2025-05-07T20:31:57.1837065Z     D=5120,
2025-05-07T20:31:57.1837259Z     scale_ub=1200.0,
2025-05-07T20:31:57.1837486Z     contiguous=True,
2025-05-07T20:31:57.1837715Z     compiled=True,
2025-05-07T20:31:57.1837922Z )
2025-05-07T20:31:57.1838247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1838741Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.1839008Z 
2025-05-07T20:31:57.1839086Z     @given(
2025-05-07T20:31:57.1839324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1839683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1839994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1840330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1840662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1840951Z     )
2025-05-07T20:31:57.1841296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1841740Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1841985Z         self,
2025-05-07T20:31:57.1842175Z         T: int,
2025-05-07T20:31:57.1842373Z         D: int,
2025-05-07T20:31:57.1842598Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1842866Z         contiguous: bool,
2025-05-07T20:31:57.1843107Z         compiled: bool,
2025-05-07T20:31:57.1843339Z     ) -> None:
2025-05-07T20:31:57.1843553Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1843798Z     
2025-05-07T20:31:57.1844079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1844422Z     
2025-05-07T20:31:57.1844628Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1844930Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1845246Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1845490Z         x0 = x[:, :D]
2025-05-07T20:31:57.1845715Z         x1 = x[:, D:]
2025-05-07T20:31:57.1845932Z     
2025-05-07T20:31:57.1846122Z         if contiguous:
2025-05-07T20:31:57.1846364Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1846631Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1846871Z     
2025-05-07T20:31:57.1847071Z         if scale_ub is not None:
2025-05-07T20:31:57.1847352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1847688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1848011Z             )
2025-05-07T20:31:57.1848215Z         else:
2025-05-07T20:31:57.1848428Z             scale_ub_tensor = None
2025-05-07T20:31:57.1848688Z     
2025-05-07T20:31:57.1848938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1849256Z             op = silu_mul_quant
2025-05-07T20:31:57.1849542Z             if compiled:
2025-05-07T20:31:57.1849844Z                 op = torch.compile(op)
2025-05-07T20:31:57.1850148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1850428Z     
2025-05-07T20:31:57.1850634Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.1850932Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.1851225Z     
2025-05-07T20:31:57.1851471Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1851813Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.1852110Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.1852439Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.1852807Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.1853116Z     
2025-05-07T20:31:57.1853327Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.1853651Z 
2025-05-07T20:31:57.1853759Z moe/activation_test.py:126: 
2025-05-07T20:31:57.1854136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1854477Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.1854809Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.1855598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.1856348Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.1856902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1857589Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1858278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.1858999Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.1859763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:57.1860588Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.1861316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.1861946Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.1862545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.1863064Z     fn()
2025-05-07T20:31:57.1863573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.1864160Z     self.fn.run(
2025-05-07T20:31:57.1864632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1865166Z     kernel = self.compile(
2025-05-07T20:31:57.1865703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1866355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1866756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1866982Z 
2025-05-07T20:31:57.1867196Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ce57a860>
2025-05-07T20:31:57.1868268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1869641Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bdf05ab0>}
2025-05-07T20:31:57.1870983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1872002Z context = <triton._C.libtriton.ir.context object at 0x7f07bde2dbb0>
2025-05-07T20:31:57.1872289Z 
2025-05-07T20:31:57.1872455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1872975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1873443Z                            module_map=module_map)
2025-05-07T20:31:57.1873814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1874166Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.1874440Z E       ^
2025-05-07T20:31:57.1874906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1875439Z 
2025-05-07T20:31:57.1875924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1876450Z 
2025-05-07T20:31:57.1876555Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1876972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1877374Z     T=16384,
2025-05-07T20:31:57.1877569Z     D=7168,
2025-05-07T20:31:57.1877774Z     scale_ub=1200.0,
2025-05-07T20:31:57.1878006Z     contiguous=False,
2025-05-07T20:31:57.1878232Z     compiled=False,
2025-05-07T20:31:57.1878448Z )
2025-05-07T20:31:57.7398775Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.7399995Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:57.7401710Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.7403510Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.7404869Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.7406244Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.7407551Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.7408916Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.7410315Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.7411540Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:57.7412752Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.7413957Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:57.7415149Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:57.7416158Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:57.7417355Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.7419086Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.7420294Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:57.7421324Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:57.7422492Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.7423840Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.7424904Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.7425806Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.7426542Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:57.7427547Z W0507 20:31:57.736000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.8946566Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.8947662Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:57.8949030Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.8950717Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.8952105Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.8953476Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.8955007Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.8956376Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.8957778Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.8959016Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:57.8960576Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.8961944Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:57.8962976Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:57.8963989Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:57.8965198Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.8966460Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.8967573Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:57.8968607Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:57.8969801Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.8971169Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.8972214Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.8973126Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.8973860Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:57.8974872Z W0507 20:31:57.891000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.9501831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.9502466Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.9502881Z 
2025-05-07T20:31:58.9503009Z     @given(
2025-05-07T20:31:58.9503330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.9503803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.9504226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.9504679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.9505102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.9505405Z     )
2025-05-07T20:31:58.9505759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.9506194Z     def test_silu_mul_quant(
2025-05-07T20:31:58.9506445Z         self,
2025-05-07T20:31:58.9506645Z         T: int,
2025-05-07T20:31:58.9506839Z         D: int,
2025-05-07T20:31:58.9507063Z         scale_ub: Optional[float],
2025-05-07T20:31:58.9507336Z         contiguous: bool,
2025-05-07T20:31:58.9507575Z         compiled: bool,
2025-05-07T20:31:58.9507803Z     ) -> None:
2025-05-07T20:31:58.9508023Z         torch.manual_seed(2025)
2025-05-07T20:31:58.9508263Z     
2025-05-07T20:31:58.9508543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.9509247Z     
2025-05-07T20:31:58.9509439Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.9509958Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.9510280Z         x = x_sign * x_clamp
2025-05-07T20:31:58.9510524Z         x0 = x[:, :D]
2025-05-07T20:31:58.9510740Z         x1 = x[:, D:]
2025-05-07T20:31:58.9510956Z     
2025-05-07T20:31:58.9511146Z         if contiguous:
2025-05-07T20:31:58.9511378Z             x0 = x0.contiguous()
2025-05-07T20:31:58.9511639Z             x1 = x1.contiguous()
2025-05-07T20:31:58.9511881Z     
2025-05-07T20:31:58.9512072Z         if scale_ub is not None:
2025-05-07T20:31:58.9512349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.9512690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.9512992Z             )
2025-05-07T20:31:58.9513188Z         else:
2025-05-07T20:31:58.9513402Z             scale_ub_tensor = None
2025-05-07T20:31:58.9513650Z     
2025-05-07T20:31:58.9513893Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.9514207Z             op = silu_mul_quant
2025-05-07T20:31:58.9514462Z             if compiled:
2025-05-07T20:31:58.9514714Z                 op = torch.compile(op)
2025-05-07T20:31:58.9515013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9515288Z     
2025-05-07T20:31:58.9515476Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.9515646Z 
2025-05-07T20:31:58.9515749Z moe/activation_test.py:117: 
2025-05-07T20:31:58.9516047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9516375Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.9516660Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9517352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.9518036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.9518581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.9519264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.9519928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.9520504Z     kernel = self.compile(
2025-05-07T20:31:58.9521045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.9521720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.9522107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9522340Z 
2025-05-07T20:31:58.9532886Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ce139e40>
2025-05-07T20:31:58.9534021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.9535447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bdf05870>}
2025-05-07T20:31:58.9536794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.9537822Z context = <triton._C.libtriton.ir.context object at 0x7f07bdde47f0>
2025-05-07T20:31:58.9538126Z 
2025-05-07T20:31:58.9538304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.9538838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.9539343Z                            module_map=module_map)
2025-05-07T20:31:58.9540017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.9540388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.9540744Z E       ^
2025-05-07T20:31:58.9541235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.9541689Z 
2025-05-07T20:31:58.9542116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.9542633Z 
2025-05-07T20:31:58.9542746Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.9543183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.9543595Z     T=1,
2025-05-07T20:31:58.9543789Z     D=7168,
2025-05-07T20:31:58.9544001Z     scale_ub=None,
2025-05-07T20:31:58.9544232Z     contiguous=True,
2025-05-07T20:31:58.9544463Z     compiled=True,
2025-05-07T20:31:58.9544685Z )
2025-05-07T20:31:58.9545018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.9545525Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.9545789Z 
2025-05-07T20:31:58.9545873Z     @given(
2025-05-07T20:31:58.9546119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.9546444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.9546756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.9547094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.9547435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.9547734Z     )
2025-05-07T20:31:58.9548097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.9548553Z     def test_silu_mul_quant(
2025-05-07T20:31:58.9548803Z         self,
2025-05-07T20:31:58.9549013Z         T: int,
2025-05-07T20:31:58.9549225Z         D: int,
2025-05-07T20:31:58.9549459Z         scale_ub: Optional[float],
2025-05-07T20:31:58.9549739Z         contiguous: bool,
2025-05-07T20:31:58.9550001Z         compiled: bool,
2025-05-07T20:31:58.9550286Z     ) -> None:
2025-05-07T20:31:58.9550509Z         torch.manual_seed(2025)
2025-05-07T20:31:58.9550764Z     
2025-05-07T20:31:58.9551050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.9551395Z     
2025-05-07T20:31:58.9551603Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.9551905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.9552221Z         x = x_sign * x_clamp
2025-05-07T20:31:58.9552477Z         x0 = x[:, :D]
2025-05-07T20:31:58.9552711Z         x1 = x[:, D:]
2025-05-07T20:31:58.9552927Z     
2025-05-07T20:31:58.9553128Z         if contiguous:
2025-05-07T20:31:58.9553377Z             x0 = x0.contiguous()
2025-05-07T20:31:58.9553645Z             x1 = x1.contiguous()
2025-05-07T20:31:58.9553898Z     
2025-05-07T20:31:58.9554109Z         if scale_ub is not None:
2025-05-07T20:31:58.9554399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.9554754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.9555082Z             )
2025-05-07T20:31:58.9555292Z         else:
2025-05-07T20:31:58.9555511Z             scale_ub_tensor = None
2025-05-07T20:31:58.9555779Z     
2025-05-07T20:31:58.9556030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.9556355Z             op = silu_mul_quant
2025-05-07T20:31:58.9556621Z             if compiled:
2025-05-07T20:31:58.9556884Z                 op = torch.compile(op)
2025-05-07T20:31:58.9557189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.9557476Z     
2025-05-07T20:31:58.9557688Z         y_fp8, y_scale = fn()
2025-05-07T20:31:58.9557980Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:58.9558285Z     
2025-05-07T20:31:58.9558536Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.9558940Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:58.9559391Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:58.9559789Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:58.9560160Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.9560479Z     
2025-05-07T20:31:58.9560683Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:58.9560884Z 
2025-05-07T20:31:58.9560988Z moe/activation_test.py:126: 
2025-05-07T20:31:58.9561293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9561637Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:58.9561962Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.9562760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:58.9563543Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:58.9564108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.9564831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.9565548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:58.9566302Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.9567082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:58.9567861Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.9568622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:58.9569295Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:58.9569917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:58.9570462Z     fn()
2025-05-07T20:31:58.9570999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:58.9571600Z     self.fn.run(
2025-05-07T20:31:58.9572086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.9572639Z     kernel = self.compile(
2025-05-07T20:31:58.9573205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.9573883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.9574299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.9574536Z 
2025-05-07T20:31:58.9574761Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ce045b70>
2025-05-07T20:31:58.9575901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.9577330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8d870>}
2025-05-07T20:31:58.9578728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.9579869Z context = <triton._C.libtriton.ir.context object at 0x7f07bd7465f0>
2025-05-07T20:31:58.9580201Z 
2025-05-07T20:31:58.9580404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.9580944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.9581519Z                            module_map=module_map)
2025-05-07T20:31:58.9581974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.9582351Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:58.9582624Z E       ^
2025-05-07T20:31:58.9583109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.9583575Z 
2025-05-07T20:31:58.9584016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.9584552Z 
2025-05-07T20:31:58.9584669Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.9585096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.9585520Z     T=4096,
2025-05-07T20:31:58.9585719Z     D=5120,
2025-05-07T20:31:58.9585917Z     scale_ub=None,
2025-05-07T20:31:58.9586145Z     contiguous=False,
2025-05-07T20:31:58.9586390Z     compiled=False,
2025-05-07T20:31:58.9586601Z )
2025-05-07T20:31:59.5465970Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.5467054Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:59.5468395Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.5469823Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.5471195Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.5472580Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.5473878Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.5475255Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.5476666Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.5477909Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:59.5479122Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.5480350Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:59.5481406Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:59.5482420Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:59.5484124Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.5485397Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.5486503Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.5487546Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:59.5488715Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.5490409Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.5491478Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.5492376Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.5493106Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:59.5494104Z W0507 20:31:59.543000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.1605274Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.1606448Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:00.1607777Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.1609187Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.1610553Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.1611938Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.1613227Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.1614577Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.1615967Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.1617666Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:00.1618882Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.1620182Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:00.1621216Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:00.1622222Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:00.1623447Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.1624723Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.1625830Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:00.1626867Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:00.1628031Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.1629386Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.1630447Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.1631404Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.1632143Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:00.1633151Z W0507 20:32:00.157000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3147071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3147879Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3148278Z 
2025-05-07T20:32:01.3148389Z     @given(
2025-05-07T20:32:01.3148717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3149129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3149537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3149966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3150297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3150589Z     )
2025-05-07T20:32:01.3151005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3151454Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3151694Z         self,
2025-05-07T20:32:01.3151894Z         T: int,
2025-05-07T20:32:01.3152092Z         D: int,
2025-05-07T20:32:01.3152306Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3152960Z         contiguous: bool,
2025-05-07T20:32:01.3153205Z         compiled: bool,
2025-05-07T20:32:01.3153571Z     ) -> None:
2025-05-07T20:32:01.3153795Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3154040Z     
2025-05-07T20:32:01.3154311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3154661Z     
2025-05-07T20:32:01.3154860Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3155147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3155460Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3155709Z         x0 = x[:, :D]
2025-05-07T20:32:01.3155921Z         x1 = x[:, D:]
2025-05-07T20:32:01.3156133Z     
2025-05-07T20:32:01.3156326Z         if contiguous:
2025-05-07T20:32:01.3156561Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3156819Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3157065Z     
2025-05-07T20:32:01.3157263Z         if scale_ub is not None:
2025-05-07T20:32:01.3157541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3157881Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3158201Z             )
2025-05-07T20:32:01.3158392Z         else:
2025-05-07T20:32:01.3158611Z             scale_ub_tensor = None
2025-05-07T20:32:01.3158861Z     
2025-05-07T20:32:01.3159094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3159411Z             op = silu_mul_quant
2025-05-07T20:32:01.3159666Z             if compiled:
2025-05-07T20:32:01.3159920Z                 op = torch.compile(op)
2025-05-07T20:32:01.3160223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3160499Z     
2025-05-07T20:32:01.3160693Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3160865Z 
2025-05-07T20:32:01.3160968Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3161267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3161604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3161893Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3162590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3163289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3163825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3164510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3165175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3165720Z     kernel = self.compile(
2025-05-07T20:32:01.3166261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3166921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3167325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3167550Z 
2025-05-07T20:32:01.3167771Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd7d61a0>
2025-05-07T20:32:01.3168850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3170235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8eb90>}
2025-05-07T20:32:01.3171575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3172597Z context = <triton._C.libtriton.ir.context object at 0x7f07bd7b1670>
2025-05-07T20:32:01.3172975Z 
2025-05-07T20:32:01.3173144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3173737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3174214Z                            module_map=module_map)
2025-05-07T20:32:01.3174583Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3174934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3175190Z E       ^
2025-05-07T20:32:01.3175657Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3176101Z 
2025-05-07T20:32:01.3176518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3177032Z 
2025-05-07T20:32:01.3177135Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3177548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3177952Z     T=4096,
2025-05-07T20:32:01.3178137Z     D=7168,
2025-05-07T20:32:01.3178338Z     scale_ub=None,
2025-05-07T20:32:01.3178568Z     contiguous=False,
2025-05-07T20:32:01.3178792Z     compiled=False,
2025-05-07T20:32:01.3179001Z )
2025-05-07T20:32:01.3179322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3179988Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3180286Z 
2025-05-07T20:32:01.3180364Z     @given(
2025-05-07T20:32:01.3180601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3180909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3181223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3181560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3181892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3182171Z     )
2025-05-07T20:32:01.3182536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3182986Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3183227Z         self,
2025-05-07T20:32:01.3183430Z         T: int,
2025-05-07T20:32:01.3183635Z         D: int,
2025-05-07T20:32:01.3183855Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3184135Z         contiguous: bool,
2025-05-07T20:32:01.3184382Z         compiled: bool,
2025-05-07T20:32:01.3184603Z     ) -> None:
2025-05-07T20:32:01.3184827Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3185077Z     
2025-05-07T20:32:01.3185350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3185699Z     
2025-05-07T20:32:01.3185896Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3186182Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3186493Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3186737Z         x0 = x[:, :D]
2025-05-07T20:32:01.3186965Z         x1 = x[:, D:]
2025-05-07T20:32:01.3187168Z     
2025-05-07T20:32:01.3187366Z         if contiguous:
2025-05-07T20:32:01.3187606Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3187862Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3188106Z     
2025-05-07T20:32:01.3188306Z         if scale_ub is not None:
2025-05-07T20:32:01.3188576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3188917Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3189234Z             )
2025-05-07T20:32:01.3189421Z         else:
2025-05-07T20:32:01.3189638Z             scale_ub_tensor = None
2025-05-07T20:32:01.3190304Z     
2025-05-07T20:32:01.3190542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3190859Z             op = silu_mul_quant
2025-05-07T20:32:01.3191113Z             if compiled:
2025-05-07T20:32:01.3191359Z                 op = torch.compile(op)
2025-05-07T20:32:01.3191658Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3192168Z     
2025-05-07T20:32:01.3192363Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3192529Z 
2025-05-07T20:32:01.3192738Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3193041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3193375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3193659Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3194350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3195044Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3195586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3196262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3196930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3197467Z     kernel = self.compile(
2025-05-07T20:32:01.3198008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3198673Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3199082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3199308Z 
2025-05-07T20:32:01.3199528Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd7d5810>
2025-05-07T20:32:01.3200600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3201986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8eb00>}
2025-05-07T20:32:01.3203338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3204360Z context = <triton._C.libtriton.ir.context object at 0x7f07bd3a3ef0>
2025-05-07T20:32:01.3204646Z 
2025-05-07T20:32:01.3204819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3205332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3205798Z                            module_map=module_map)
2025-05-07T20:32:01.3206166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3206515Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3206771Z E       ^
2025-05-07T20:32:01.3207235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3207692Z 
2025-05-07T20:32:01.3208119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3208626Z 
2025-05-07T20:32:01.3208732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3209145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3209542Z     T=128,
2025-05-07T20:32:01.3209726Z     D=7168,
2025-05-07T20:32:01.3209922Z     scale_ub=None,
2025-05-07T20:32:01.3210144Z     contiguous=False,
2025-05-07T20:32:01.3210370Z     compiled=True,
2025-05-07T20:32:01.3210578Z )
2025-05-07T20:32:01.3848055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3848787Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3849169Z 
2025-05-07T20:32:01.3849298Z     @given(
2025-05-07T20:32:01.3849553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3850228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3850538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3850997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3851337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3851631Z     )
2025-05-07T20:32:01.3851981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3852427Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3852677Z         self,
2025-05-07T20:32:01.3852873Z         T: int,
2025-05-07T20:32:01.3853077Z         D: int,
2025-05-07T20:32:01.3853302Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3853580Z         contiguous: bool,
2025-05-07T20:32:01.3853826Z         compiled: bool,
2025-05-07T20:32:01.3854061Z     ) -> None:
2025-05-07T20:32:01.3854286Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3854528Z     
2025-05-07T20:32:01.3854807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3855166Z     
2025-05-07T20:32:01.3855362Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3855665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3855981Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3856222Z         x0 = x[:, :D]
2025-05-07T20:32:01.3856444Z         x1 = x[:, D:]
2025-05-07T20:32:01.3856665Z     
2025-05-07T20:32:01.3856851Z         if contiguous:
2025-05-07T20:32:01.3857094Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3857359Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3857597Z     
2025-05-07T20:32:01.3857800Z         if scale_ub is not None:
2025-05-07T20:32:01.3858077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3858436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3866127Z             )
2025-05-07T20:32:01.3866348Z         else:
2025-05-07T20:32:01.3866579Z             scale_ub_tensor = None
2025-05-07T20:32:01.3866854Z     
2025-05-07T20:32:01.3867113Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3867445Z             op = silu_mul_quant
2025-05-07T20:32:01.3867720Z             if compiled:
2025-05-07T20:32:01.3867977Z                 op = torch.compile(op)
2025-05-07T20:32:01.3868287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3868574Z     
2025-05-07T20:32:01.3868774Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.3869077Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.3869378Z     
2025-05-07T20:32:01.3869616Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3869958Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.3870255Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.3870572Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.3870930Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.3871243Z     
2025-05-07T20:32:01.3871455Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.3871650Z 
2025-05-07T20:32:01.3871753Z moe/activation_test.py:126: 
2025-05-07T20:32:01.3872063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3872404Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.3872731Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.3873528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.3874283Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.3874837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3875519Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3876220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.3877077Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.3877909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.3878666Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.3879400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.3880049Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.3880658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.3881228Z     fn()
2025-05-07T20:32:01.3881742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.3882341Z     self.fn.run(
2025-05-07T20:32:01.3882815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3883357Z     kernel = self.compile(
2025-05-07T20:32:01.3883915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3884579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3884975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3885210Z 
2025-05-07T20:32:01.3885426Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd7d7be0>
2025-05-07T20:32:01.3886512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3887900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8fac0>}
2025-05-07T20:32:01.3889253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3890597Z context = <triton._C.libtriton.ir.context object at 0x7f07bd1a0270>
2025-05-07T20:32:01.3890897Z 
2025-05-07T20:32:01.3891070Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3891604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3892073Z                            module_map=module_map)
2025-05-07T20:32:01.3892451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3892818Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.3893085Z E       ^
2025-05-07T20:32:01.3893570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3894026Z 
2025-05-07T20:32:01.3894458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3894971Z 
2025-05-07T20:32:01.3895088Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3895499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3895910Z     T=128,
2025-05-07T20:32:01.3896112Z     D=7168,
2025-05-07T20:32:01.3896306Z     scale_ub=None,
2025-05-07T20:32:01.3896536Z     contiguous=False,
2025-05-07T20:32:01.3896775Z     compiled=False,
2025-05-07T20:32:01.3896991Z )
2025-05-07T20:32:01.7577991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.7578787Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.7579180Z 
2025-05-07T20:32:01.7579663Z     @given(
2025-05-07T20:32:01.7580180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.7580689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.7581005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.7581343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.7581677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.7581970Z     )
2025-05-07T20:32:01.7582329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.7582773Z     def test_silu_mul_quant(
2025-05-07T20:32:01.7583027Z         self,
2025-05-07T20:32:01.7583233Z         T: int,
2025-05-07T20:32:01.7583434Z         D: int,
2025-05-07T20:32:01.7583663Z         scale_ub: Optional[float],
2025-05-07T20:32:01.7583946Z         contiguous: bool,
2025-05-07T20:32:01.7584186Z         compiled: bool,
2025-05-07T20:32:01.7584425Z     ) -> None:
2025-05-07T20:32:01.7584650Z         torch.manual_seed(2025)
2025-05-07T20:32:01.7584913Z     
2025-05-07T20:32:01.7585188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.7585547Z     
2025-05-07T20:32:01.7585749Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.7586046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.7586366Z         x = x_sign * x_clamp
2025-05-07T20:32:01.7586616Z         x0 = x[:, :D]
2025-05-07T20:32:01.7586836Z         x1 = x[:, D:]
2025-05-07T20:32:01.7587053Z     
2025-05-07T20:32:01.7587248Z         if contiguous:
2025-05-07T20:32:01.7587486Z             x0 = x0.contiguous()
2025-05-07T20:32:01.7587756Z             x1 = x1.contiguous()
2025-05-07T20:32:01.7588001Z     
2025-05-07T20:32:01.7588198Z         if scale_ub is not None:
2025-05-07T20:32:01.7588485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.7588833Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.7589145Z             )
2025-05-07T20:32:01.7589361Z         else:
2025-05-07T20:32:01.7589592Z             scale_ub_tensor = None
2025-05-07T20:32:01.7590094Z     
2025-05-07T20:32:01.7590337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.7590662Z             op = silu_mul_quant
2025-05-07T20:32:01.7590925Z             if compiled:
2025-05-07T20:32:01.7591182Z                 op = torch.compile(op)
2025-05-07T20:32:01.7591488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7591776Z     
2025-05-07T20:32:01.7591973Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.7592152Z 
2025-05-07T20:32:01.7592256Z moe/activation_test.py:117: 
2025-05-07T20:32:01.7592565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7592900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.7593189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7593889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.7594596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.7595141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.7595839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.7596509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.7597050Z     kernel = self.compile(
2025-05-07T20:32:01.7597604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.7598271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.7598675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7598905Z 
2025-05-07T20:32:01.7599119Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd902bc0>
2025-05-07T20:32:01.7600444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.7601841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bda38f70>}
2025-05-07T20:32:01.7603194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.7604227Z context = <triton._C.libtriton.ir.context object at 0x7f07bd1d07b0>
2025-05-07T20:32:01.7604519Z 
2025-05-07T20:32:01.7604689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.7605219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.7605699Z                            module_map=module_map)
2025-05-07T20:32:01.7606075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.7606436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.7606707Z E       ^
2025-05-07T20:32:01.7607185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.7607636Z 
2025-05-07T20:32:01.7608055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.7608579Z 
2025-05-07T20:32:01.7608689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.7609109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.7609517Z     T=4096,
2025-05-07T20:32:01.7609713Z     D=5120,
2025-05-07T20:32:01.7609916Z     scale_ub=1200.0,
2025-05-07T20:32:01.7610154Z     contiguous=True,
2025-05-07T20:32:01.7610390Z     compiled=False,
2025-05-07T20:32:01.7610612Z )
2025-05-07T20:32:01.7610948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.7611453Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.7611734Z 
2025-05-07T20:32:01.7611820Z     @given(
2025-05-07T20:32:01.7612066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.7612386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.7612708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.7613046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.7613385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.7613674Z     )
2025-05-07T20:32:01.7614034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.7614485Z     def test_silu_mul_quant(
2025-05-07T20:32:01.7614733Z         self,
2025-05-07T20:32:01.7614942Z         T: int,
2025-05-07T20:32:01.7615151Z         D: int,
2025-05-07T20:32:01.7615377Z         scale_ub: Optional[float],
2025-05-07T20:32:01.7615667Z         contiguous: bool,
2025-05-07T20:32:01.7615916Z         compiled: bool,
2025-05-07T20:32:01.7616142Z     ) -> None:
2025-05-07T20:32:01.7616369Z         torch.manual_seed(2025)
2025-05-07T20:32:01.7616621Z     
2025-05-07T20:32:01.7616898Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.7617246Z     
2025-05-07T20:32:01.7617450Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.7617746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.7618067Z         x = x_sign * x_clamp
2025-05-07T20:32:01.7618319Z         x0 = x[:, :D]
2025-05-07T20:32:01.7618546Z         x1 = x[:, D:]
2025-05-07T20:32:01.7618757Z     
2025-05-07T20:32:01.7618959Z         if contiguous:
2025-05-07T20:32:01.7619206Z             x0 = x0.contiguous()
2025-05-07T20:32:01.7619474Z             x1 = x1.contiguous()
2025-05-07T20:32:01.7619911Z     
2025-05-07T20:32:01.7620116Z         if scale_ub is not None:
2025-05-07T20:32:01.7620500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.7620845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.7621164Z             )
2025-05-07T20:32:01.7621365Z         else:
2025-05-07T20:32:01.7621589Z             scale_ub_tensor = None
2025-05-07T20:32:01.7621850Z     
2025-05-07T20:32:01.7622087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.7622412Z             op = silu_mul_quant
2025-05-07T20:32:01.7622678Z             if compiled:
2025-05-07T20:32:01.7622929Z                 op = torch.compile(op)
2025-05-07T20:32:01.7623237Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7623528Z     
2025-05-07T20:32:01.7623725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.7623905Z 
2025-05-07T20:32:01.7624013Z moe/activation_test.py:117: 
2025-05-07T20:32:01.7624320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7624671Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.7624963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.7625664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.7626370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.7626913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.7627606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.7628283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.7628827Z     kernel = self.compile(
2025-05-07T20:32:01.7629373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.7630044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.7630453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.7630685Z 
2025-05-07T20:32:01.7630911Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bdfaace0>
2025-05-07T20:32:01.7632043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.7633421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bda39510>}
2025-05-07T20:32:01.7634775Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.7635813Z context = <triton._C.libtriton.ir.context object at 0x7f07bcc9d630>
2025-05-07T20:32:01.7636111Z 
2025-05-07T20:32:01.7636284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.7636815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.7637297Z                            module_map=module_map)
2025-05-07T20:32:01.7637675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.7638039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.7638311Z E       ^
2025-05-07T20:32:01.7638786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.7639236Z 
2025-05-07T20:32:01.7639663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.7640273Z 
2025-05-07T20:32:01.7640381Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.7640874Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.7641290Z     T=1,
2025-05-07T20:32:01.7641477Z     D=5120,
2025-05-07T20:32:01.7641684Z     scale_ub=None,
2025-05-07T20:32:01.7641912Z     contiguous=True,
2025-05-07T20:32:01.7642137Z     compiled=True,
2025-05-07T20:32:01.7642347Z )
2025-05-07T20:32:02.2212270Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.2213351Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:02.2214694Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.2216152Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.2217520Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.2218888Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2220289Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.2221666Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2223063Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.2224297Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:02.2225506Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.2226719Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:02.2227754Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:02.2228760Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:02.2229960Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.2231268Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.2232378Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:02.2233906Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:02.2235078Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.2236421Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.2237474Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2238369Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.2239113Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:02.2240126Z W0507 20:32:02.217000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.3829515Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.3831004Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:02.3832471Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.3833948Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.3835317Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.3836700Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.3837993Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.3839356Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.3840755Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.3841988Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:02.3843208Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.3844404Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:02.3845912Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:02.3846925Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:02.3848132Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.3849400Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.3850503Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:02.3851556Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:02.3852717Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.3854064Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.3855114Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.3856015Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.3856743Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:02.3857769Z W0507 20:32:02.379000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.8226778Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.8227437Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:02.8227697Z 
2025-05-07T20:32:02.8227786Z     @given(
2025-05-07T20:32:02.8228018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.8228338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.8228652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.8228990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.8229316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.8229613Z     )
2025-05-07T20:32:02.8230002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.8230452Z     def test_silu_mul_quant(
2025-05-07T20:32:02.8230698Z         self,
2025-05-07T20:32:02.8230900Z         T: int,
2025-05-07T20:32:02.8231096Z         D: int,
2025-05-07T20:32:02.8231348Z         scale_ub: Optional[float],
2025-05-07T20:32:02.8231647Z         contiguous: bool,
2025-05-07T20:32:02.8231885Z         compiled: bool,
2025-05-07T20:32:02.8232121Z     ) -> None:
2025-05-07T20:32:02.8232342Z         torch.manual_seed(2025)
2025-05-07T20:32:02.8232580Z     
2025-05-07T20:32:02.8232862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.8233206Z     
2025-05-07T20:32:02.8233409Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.8233702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.8234022Z         x = x_sign * x_clamp
2025-05-07T20:32:02.8234276Z         x0 = x[:, :D]
2025-05-07T20:32:02.8234903Z         x1 = x[:, D:]
2025-05-07T20:32:02.8235117Z     
2025-05-07T20:32:02.8235309Z         if contiguous:
2025-05-07T20:32:02.8235680Z             x0 = x0.contiguous()
2025-05-07T20:32:02.8235949Z             x1 = x1.contiguous()
2025-05-07T20:32:02.8236193Z     
2025-05-07T20:32:02.8236389Z         if scale_ub is not None:
2025-05-07T20:32:02.8236673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.8237014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.8237320Z             )
2025-05-07T20:32:02.8237520Z         else:
2025-05-07T20:32:02.8237743Z             scale_ub_tensor = None
2025-05-07T20:32:02.8237990Z     
2025-05-07T20:32:02.8238233Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.8238551Z             op = silu_mul_quant
2025-05-07T20:32:02.8238819Z             if compiled:
2025-05-07T20:32:02.8239073Z                 op = torch.compile(op)
2025-05-07T20:32:02.8239375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.8239666Z     
2025-05-07T20:32:02.8239861Z         y_fp8, y_scale = fn()
2025-05-07T20:32:02.8240163Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:02.8240461Z     
2025-05-07T20:32:02.8240699Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.8241045Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:02.8241347Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:02.8241662Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:02.8242029Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.8242348Z     
2025-05-07T20:32:02.8242549Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:02.8242754Z 
2025-05-07T20:32:02.8242858Z moe/activation_test.py:126: 
2025-05-07T20:32:02.8243162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.8243498Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:02.8243827Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.8244623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:02.8245375Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:02.8245932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.8246611Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.8247300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:02.8248024Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.8248777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:02.8249523Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.8250261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:02.8250905Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:02.8251557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:02.8252078Z     fn()
2025-05-07T20:32:02.8252598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:02.8253178Z     self.fn.run(
2025-05-07T20:32:02.8253641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.8254176Z     kernel = self.compile(
2025-05-07T20:32:02.8254716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.8255465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.8255939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.8256174Z 
2025-05-07T20:32:02.8256384Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bde9c130>
2025-05-07T20:32:02.8257469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.8258864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bdf070a0>}
2025-05-07T20:32:02.8260315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.8261374Z context = <triton._C.libtriton.ir.context object at 0x7f07bccf9930>
2025-05-07T20:32:02.8261693Z 
2025-05-07T20:32:02.8261868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.8262410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.8269776Z                            module_map=module_map)
2025-05-07T20:32:02.8270185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.8270562Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:02.8270844Z E       ^
2025-05-07T20:32:02.8271330Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.8271789Z 
2025-05-07T20:32:02.8272218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.8272746Z 
2025-05-07T20:32:02.8272864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.8273288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.8273700Z     T=2048,
2025-05-07T20:32:02.8273908Z     D=5120,
2025-05-07T20:32:02.8274111Z     scale_ub=None,
2025-05-07T20:32:02.8274341Z     contiguous=True,
2025-05-07T20:32:02.8274577Z     compiled=True,
2025-05-07T20:32:02.8274793Z )
2025-05-07T20:32:03.2400274Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.2401381Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:03.2402737Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.2404203Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.2405580Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.2406961Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2408254Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.2410117Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2411528Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.2412751Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:03.2413966Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.2415163Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:03.2416205Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:03.2417217Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:03.2418416Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.2419691Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.2420886Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:03.2421931Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:03.2423093Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.2424437Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.2425489Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2426402Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.2427146Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:03.2428156Z W0507 20:32:03.236000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.4004640Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.4006021Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:03.4007363Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.4009320Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.4010695Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.4012061Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.4013379Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.4015692Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.4017116Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.4018356Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:03.4019558Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.4020836Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:03.4021888Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:03.4022912Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:03.4024122Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.4025394Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.4026504Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:03.4027547Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:03.4028720Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.4030083Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.4031133Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.4032043Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.4032786Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:03.4033970Z W0507 20:32:03.397000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.8386108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.8386675Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:03.8386954Z 
2025-05-07T20:32:03.8387039Z     @given(
2025-05-07T20:32:03.8387288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.8387613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.8387924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.8388266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.8388607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.8388902Z     )
2025-05-07T20:32:03.8389296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.8389762Z     def test_silu_mul_quant(
2025-05-07T20:32:03.8390263Z         self,
2025-05-07T20:32:03.8390478Z         T: int,
2025-05-07T20:32:03.8390689Z         D: int,
2025-05-07T20:32:03.8390923Z         scale_ub: Optional[float],
2025-05-07T20:32:03.8391199Z         contiguous: bool,
2025-05-07T20:32:03.8391459Z         compiled: bool,
2025-05-07T20:32:03.8391702Z     ) -> None:
2025-05-07T20:32:03.8391922Z         torch.manual_seed(2025)
2025-05-07T20:32:03.8392180Z     
2025-05-07T20:32:03.8392466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.8392813Z     
2025-05-07T20:32:03.8393019Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.8393320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.8393630Z         x = x_sign * x_clamp
2025-05-07T20:32:03.8393883Z         x0 = x[:, :D]
2025-05-07T20:32:03.8394120Z         x1 = x[:, D:]
2025-05-07T20:32:03.8394331Z     
2025-05-07T20:32:03.8394527Z         if contiguous:
2025-05-07T20:32:03.8394776Z             x0 = x0.contiguous()
2025-05-07T20:32:03.8395040Z             x1 = x1.contiguous()
2025-05-07T20:32:03.8395291Z     
2025-05-07T20:32:03.8395498Z         if scale_ub is not None:
2025-05-07T20:32:03.8395776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.8396123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.8396438Z             )
2025-05-07T20:32:03.8396642Z         else:
2025-05-07T20:32:03.8396860Z             scale_ub_tensor = None
2025-05-07T20:32:03.8397119Z     
2025-05-07T20:32:03.8397359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.8397675Z             op = silu_mul_quant
2025-05-07T20:32:03.8397936Z             if compiled:
2025-05-07T20:32:03.8398195Z                 op = torch.compile(op)
2025-05-07T20:32:03.8398494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.8398781Z     
2025-05-07T20:32:03.8398989Z         y_fp8, y_scale = fn()
2025-05-07T20:32:03.8399283Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:03.8399582Z     
2025-05-07T20:32:03.8399830Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.8400168Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:03.8400477Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:03.8400803Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:03.8401171Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.8401492Z     
2025-05-07T20:32:03.8401743Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:03.8401947Z 
2025-05-07T20:32:03.8402060Z moe/activation_test.py:126: 
2025-05-07T20:32:03.8402368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8402720Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:03.8403436Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.8404384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:03.8405161Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:03.8405718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.8406416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.8407107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:03.8407835Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:03.8408596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:03.8409357Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:03.8410098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:03.8410744Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:03.8411352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:03.8411877Z     fn()
2025-05-07T20:32:03.8412390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:03.8412975Z     self.fn.run(
2025-05-07T20:32:03.8413454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.8413983Z     kernel = self.compile(
2025-05-07T20:32:03.8414530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.8415196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.8415603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8415834Z 
2025-05-07T20:32:03.8416050Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bdf6cd00>
2025-05-07T20:32:03.8417129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.8418533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bd48a7a0>}
2025-05-07T20:32:03.8419960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.8420984Z context = <triton._C.libtriton.ir.context object at 0x7f07bcdc7170>
2025-05-07T20:32:03.8421284Z 
2025-05-07T20:32:03.8421456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.8421984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.8422455Z                            module_map=module_map)
2025-05-07T20:32:03.8422824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.8423187Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:03.8424980Z E       ^
2025-05-07T20:32:03.8425446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.8425901Z 
2025-05-07T20:32:03.8426316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.8426934Z 
2025-05-07T20:32:03.8427043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.8427534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.8427937Z     T=128,
2025-05-07T20:32:03.8428139Z     D=5120,
2025-05-07T20:32:03.8428347Z     scale_ub=None,
2025-05-07T20:32:03.8428567Z     contiguous=True,
2025-05-07T20:32:03.8428805Z     compiled=True,
2025-05-07T20:32:03.8429025Z )
2025-05-07T20:32:04.3109045Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.3110145Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:04.3111478Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.3112952Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.3114316Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.3115688Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.3116988Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.3118358Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.3119770Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.3121003Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:04.3122256Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.3123464Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:04.3124500Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.3125509Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:04.3126813Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.3128097Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.3129212Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.3130728Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:04.3131906Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.3133270Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.3134331Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.3135244Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.3135996Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:04.3137020Z W0507 20:32:04.307000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.4740773Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.4742283Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:04.4743606Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.4745059Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.4746427Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.4747790Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.4749431Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.4751164Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.4752958Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.4754521Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:04.4756053Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.4757558Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:04.4759025Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.4760047Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:04.4761258Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.4762518Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.4763625Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.4764669Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:04.4765838Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.4767193Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.4768240Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.4769146Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.4769884Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:04.4770904Z W0507 20:32:04.470000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2250208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.2250786Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.2251056Z 
2025-05-07T20:32:05.2251137Z     @given(
2025-05-07T20:32:05.2251372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.2251678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.2251992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.2252328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.2252653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.2252945Z     )
2025-05-07T20:32:05.2253327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.2253774Z     def test_silu_mul_quant(
2025-05-07T20:32:05.2254021Z         self,
2025-05-07T20:32:05.2254218Z         T: int,
2025-05-07T20:32:05.2254421Z         D: int,
2025-05-07T20:32:05.2254639Z         scale_ub: Optional[float],
2025-05-07T20:32:05.2254922Z         contiguous: bool,
2025-05-07T20:32:05.2255165Z         compiled: bool,
2025-05-07T20:32:05.2255393Z     ) -> None:
2025-05-07T20:32:05.2255613Z         torch.manual_seed(2025)
2025-05-07T20:32:05.2255855Z     
2025-05-07T20:32:05.2256127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.2256465Z     
2025-05-07T20:32:05.2256665Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.2256980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.2257285Z         x = x_sign * x_clamp
2025-05-07T20:32:05.2257531Z         x0 = x[:, :D]
2025-05-07T20:32:05.2258190Z         x1 = x[:, D:]
2025-05-07T20:32:05.2258400Z     
2025-05-07T20:32:05.2258599Z         if contiguous:
2025-05-07T20:32:05.2258988Z             x0 = x0.contiguous()
2025-05-07T20:32:05.2259248Z             x1 = x1.contiguous()
2025-05-07T20:32:05.2259494Z     
2025-05-07T20:32:05.2259693Z         if scale_ub is not None:
2025-05-07T20:32:05.2260083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.2260423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.2260734Z             )
2025-05-07T20:32:05.2260934Z         else:
2025-05-07T20:32:05.2261148Z             scale_ub_tensor = None
2025-05-07T20:32:05.2261403Z     
2025-05-07T20:32:05.2261642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.2261957Z             op = silu_mul_quant
2025-05-07T20:32:05.2262217Z             if compiled:
2025-05-07T20:32:05.2262471Z                 op = torch.compile(op)
2025-05-07T20:32:05.2262770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.2263056Z     
2025-05-07T20:32:05.2263253Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.2263546Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.2263839Z     
2025-05-07T20:32:05.2264087Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.2264416Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.2264717Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.2265042Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.2265401Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.2265708Z     
2025-05-07T20:32:05.2265911Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.2266105Z 
2025-05-07T20:32:05.2266216Z moe/activation_test.py:126: 
2025-05-07T20:32:05.2266510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.2266844Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.2267181Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.2267973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.2268722Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.2269274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.2269955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.2270639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.2271360Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.2272163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.2272907Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.2273637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.2274284Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.2274882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.2275402Z     fn()
2025-05-07T20:32:05.2275911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.2276488Z     self.fn.run(
2025-05-07T20:32:05.2276957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.2277485Z     kernel = self.compile(
2025-05-07T20:32:05.2278025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.2278792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.2279262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.2279494Z 
2025-05-07T20:32:05.2279704Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bcc46170>
2025-05-07T20:32:05.2280786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.2282179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bd48bc70>}
2025-05-07T20:32:05.2283508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.2284530Z context = <triton._C.libtriton.ir.context object at 0x7f07bc8288b0>
2025-05-07T20:32:05.2284823Z 
2025-05-07T20:32:05.2284992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.2285520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.2285982Z                            module_map=module_map)
2025-05-07T20:32:05.2286347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.2286705Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.2286975Z E       ^
2025-05-07T20:32:05.2287436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2287888Z 
2025-05-07T20:32:05.2288310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.2288836Z 
2025-05-07T20:32:05.2288942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.2289359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.2289751Z     T=4096,
2025-05-07T20:32:05.2290261Z     D=5120,
2025-05-07T20:32:05.2290459Z     scale_ub=None,
2025-05-07T20:32:05.2290670Z     contiguous=True,
2025-05-07T20:32:05.2290892Z     compiled=True,
2025-05-07T20:32:05.2291104Z )
2025-05-07T20:32:05.6997040Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.6998353Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:05.6999691Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.7001170Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.7002551Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.7003934Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.7005235Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.7007076Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.7008499Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.7009748Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:05.7010952Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.7012151Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:05.7013197Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:05.7014220Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:05.7015420Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.7016687Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.7017793Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:05.7018833Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:05.7020070Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.7021405Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.7022458Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.7023360Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.7024097Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:05.7025109Z W0507 20:32:05.696000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.8621313Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.8622638Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:05.8623960Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.8625897Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.8627277Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.8628664Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8629958Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.8631324Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8632735Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.8633973Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:05.8635179Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.8636381Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:05.8637417Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:05.8638426Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:05.8639642Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.8640907Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.8642005Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:05.8643045Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:05.8644240Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.8654526Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.8655598Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8656505Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8657242Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:05.8658460Z W0507 20:32:05.858000 87525 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.4447407Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.4447955Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:06.4448242Z 
2025-05-07T20:32:06.4448330Z     @given(
2025-05-07T20:32:06.4448570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.4448891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.4449205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.4449536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.4449878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.4450171Z     )
2025-05-07T20:32:06.4450555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.4451016Z     def test_silu_mul_quant(
2025-05-07T20:32:06.4451273Z         self,
2025-05-07T20:32:06.4451480Z         T: int,
2025-05-07T20:32:06.4451682Z         D: int,
2025-05-07T20:32:06.4451915Z         scale_ub: Optional[float],
2025-05-07T20:32:06.4452198Z         contiguous: bool,
2025-05-07T20:32:06.4452442Z         compiled: bool,
2025-05-07T20:32:06.4452680Z     ) -> None:
2025-05-07T20:32:06.4452907Z         torch.manual_seed(2025)
2025-05-07T20:32:06.4453155Z     
2025-05-07T20:32:06.4453438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.4453790Z     
2025-05-07T20:32:06.4453988Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.4454293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.4454612Z         x = x_sign * x_clamp
2025-05-07T20:32:06.4454856Z         x0 = x[:, :D]
2025-05-07T20:32:06.4455088Z         x1 = x[:, D:]
2025-05-07T20:32:06.4455306Z     
2025-05-07T20:32:06.4455495Z         if contiguous:
2025-05-07T20:32:06.4455746Z             x0 = x0.contiguous()
2025-05-07T20:32:06.4456020Z             x1 = x1.contiguous()
2025-05-07T20:32:06.4456258Z     
2025-05-07T20:32:06.4456457Z         if scale_ub is not None:
2025-05-07T20:32:06.4456737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.4457078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.4457391Z             )
2025-05-07T20:32:06.4457594Z         else:
2025-05-07T20:32:06.4457817Z             scale_ub_tensor = None
2025-05-07T20:32:06.4458072Z     
2025-05-07T20:32:06.4458311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.4458631Z             op = silu_mul_quant
2025-05-07T20:32:06.4458881Z             if compiled:
2025-05-07T20:32:06.4459137Z                 op = torch.compile(op)
2025-05-07T20:32:06.4459439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.4459717Z     
2025-05-07T20:32:06.4460000Z         y_fp8, y_scale = fn()
2025-05-07T20:32:06.4460299Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:06.4460588Z     
2025-05-07T20:32:06.4460828Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.4461163Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:06.4461462Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:06.4461775Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:06.4462140Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.4462459Z     
2025-05-07T20:32:06.4462661Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:06.4462863Z 
2025-05-07T20:32:06.4462965Z moe/activation_test.py:126: 
2025-05-07T20:32:06.4463267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.4463599Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:06.4464325Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.4465260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:06.4466019Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:06.4466572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.4467254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.4467939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:06.4468651Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.4469402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:06.4470151Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.4470892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:06.4471522Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:06.4472120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:06.4472637Z     fn()
2025-05-07T20:32:06.4473139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:06.4473714Z     self.fn.run(
2025-05-07T20:32:06.4474181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.4474716Z     kernel = self.compile(
2025-05-07T20:32:06.4475252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.4475907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.4476311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.4476540Z 
2025-05-07T20:32:06.4476754Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bcf436a0>
2025-05-07T20:32:06.4477821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.4479200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bcfa4940>}
2025-05-07T20:32:06.4480533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.4481560Z context = <triton._C.libtriton.ir.context object at 0x7f07bc355870>
2025-05-07T20:32:06.4481851Z 
2025-05-07T20:32:06.4482017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.4482543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.4483010Z                            module_map=module_map)
2025-05-07T20:32:06.4483379Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.4483730Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:06.4484003Z E       ^
2025-05-07T20:32:06.4484469Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.4484916Z 
2025-05-07T20:32:06.4485338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.4485953Z 
2025-05-07T20:32:06.4486061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.4486556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.4486969Z     T=16384,
2025-05-07T20:32:06.4487163Z     D=5120,
2025-05-07T20:32:06.4487365Z     scale_ub=None,
2025-05-07T20:32:06.4487586Z     contiguous=True,
2025-05-07T20:32:06.4487807Z     compiled=True,
2025-05-07T20:32:06.4488020Z )
2025-05-07T20:32:06.4879181Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:06.4880429Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:06.4881752Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:06.4882799Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:06.4883901Z W0507 20:32:06.486000 87525 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:06.5912999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.5913524Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:06.5913798Z 
2025-05-07T20:32:06.5913882Z     @given(
2025-05-07T20:32:06.5914113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.5914430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.5914738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.5915066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.5915415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.5915709Z     )
2025-05-07T20:32:06.5916070Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.5916511Z     def test_silu_mul_quant(
2025-05-07T20:32:06.5916759Z         self,
2025-05-07T20:32:06.5916952Z         T: int,
2025-05-07T20:32:06.5917155Z         D: int,
2025-05-07T20:32:06.5917383Z         scale_ub: Optional[float],
2025-05-07T20:32:06.5917654Z         contiguous: bool,
2025-05-07T20:32:06.5917903Z         compiled: bool,
2025-05-07T20:32:06.5918143Z     ) -> None:
2025-05-07T20:32:06.5918370Z         torch.manual_seed(2025)
2025-05-07T20:32:06.5918610Z     
2025-05-07T20:32:06.5918887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.5919229Z     
2025-05-07T20:32:06.5919422Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.5919732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.5920052Z         x = x_sign * x_clamp
2025-05-07T20:32:06.5920302Z         x0 = x[:, :D]
2025-05-07T20:32:06.5920517Z         x1 = x[:, D:]
2025-05-07T20:32:06.5920735Z     
2025-05-07T20:32:06.5920931Z         if contiguous:
2025-05-07T20:32:06.5921162Z             x0 = x0.contiguous()
2025-05-07T20:32:06.5921429Z             x1 = x1.contiguous()
2025-05-07T20:32:06.5921676Z     
2025-05-07T20:32:06.5921870Z         if scale_ub is not None:
2025-05-07T20:32:06.5922154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.5922542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.5922854Z             )
2025-05-07T20:32:06.5923056Z         else:
2025-05-07T20:32:06.5923275Z             scale_ub_tensor = None
2025-05-07T20:32:06.5923526Z     
2025-05-07T20:32:06.5923769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.5924087Z             op = silu_mul_quant
2025-05-07T20:32:06.5924346Z             if compiled:
2025-05-07T20:32:06.5924888Z                 op = torch.compile(op)
2025-05-07T20:32:06.5925193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.5925472Z     
2025-05-07T20:32:06.5925837Z         y_fp8, y_scale = fn()
2025-05-07T20:32:06.5926138Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:06.5926435Z     
2025-05-07T20:32:06.5926669Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.5927006Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:06.5927302Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:06.5927613Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:06.5927984Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.5928301Z     
2025-05-07T20:32:06.5928507Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:06.5928702Z 
2025-05-07T20:32:06.5928805Z moe/activation_test.py:126: 
2025-05-07T20:32:06.5929108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.5929449Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:06.5929782Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.5930573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:06.5931327Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:06.5931876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.5932554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.5933246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:06.5933966Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.5934711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:06.5935464Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.5936189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:06.5936832Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:06.5937426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:06.5937942Z     fn()
2025-05-07T20:32:06.5938453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:06.5939031Z     self.fn.run(
2025-05-07T20:32:06.5939491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.5940111Z     kernel = self.compile(
2025-05-07T20:32:06.5940657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.5941316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.5941717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.5941952Z 
2025-05-07T20:32:06.5942163Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bdabab00>
2025-05-07T20:32:06.5943246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.5944610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bd48a9e0>}
2025-05-07T20:32:06.5945943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.5947136Z context = <triton._C.libtriton.ir.context object at 0x7f06abfdeb30>
2025-05-07T20:32:06.5947424Z 
2025-05-07T20:32:06.5947599Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.5948120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.5948583Z                            module_map=module_map)
2025-05-07T20:32:06.5948956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.5949317Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:06.5949581Z E       ^
2025-05-07T20:32:06.5950045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.5950490Z 
2025-05-07T20:32:06.5950916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.5951438Z 
2025-05-07T20:32:06.5951559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.5951969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.5952370Z     T=1,
2025-05-07T20:32:06.5952561Z     D=5120,
2025-05-07T20:32:06.5952753Z     scale_ub=1200.0,
2025-05-07T20:32:06.5952979Z     contiguous=True,
2025-05-07T20:32:06.5953205Z     compiled=True,
2025-05-07T20:32:06.5953407Z )
2025-05-07T20:32:06.7403716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.7404258Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:06.7404518Z 
2025-05-07T20:32:06.7404610Z     @given(
2025-05-07T20:32:06.7404842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.7405158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.7405469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.7405833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.7406170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.7406461Z     )
2025-05-07T20:32:06.7406815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.7407251Z     def test_silu_mul_quant(
2025-05-07T20:32:06.7407494Z         self,
2025-05-07T20:32:06.7407690Z         T: int,
2025-05-07T20:32:06.7407886Z         D: int,
2025-05-07T20:32:06.7408111Z         scale_ub: Optional[float],
2025-05-07T20:32:06.7408389Z         contiguous: bool,
2025-05-07T20:32:06.7408627Z         compiled: bool,
2025-05-07T20:32:06.7408858Z     ) -> None:
2025-05-07T20:32:06.7409080Z         torch.manual_seed(2025)
2025-05-07T20:32:06.7409321Z     
2025-05-07T20:32:06.7409597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.7410115Z     
2025-05-07T20:32:06.7410305Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.7410614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.7410934Z         x = x_sign * x_clamp
2025-05-07T20:32:06.7411188Z         x0 = x[:, :D]
2025-05-07T20:32:06.7411408Z         x1 = x[:, D:]
2025-05-07T20:32:06.7411622Z     
2025-05-07T20:32:06.7411812Z         if contiguous:
2025-05-07T20:32:06.7412045Z             x0 = x0.contiguous()
2025-05-07T20:32:06.7412309Z             x1 = x1.contiguous()
2025-05-07T20:32:06.7412599Z     
2025-05-07T20:32:06.7412790Z         if scale_ub is not None:
2025-05-07T20:32:06.7413069Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.7413407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.7413711Z             )
2025-05-07T20:32:06.7413913Z         else:
2025-05-07T20:32:06.7414130Z             scale_ub_tensor = None
2025-05-07T20:32:06.7414394Z     
2025-05-07T20:32:06.7414624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.7415270Z             op = silu_mul_quant
2025-05-07T20:32:06.7415526Z             if compiled:
2025-05-07T20:32:06.7415780Z                 op = torch.compile(op)
2025-05-07T20:32:06.7416210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.7416492Z     
2025-05-07T20:32:06.7416686Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:06.7416860Z 
2025-05-07T20:32:06.7416962Z moe/activation_test.py:117: 
2025-05-07T20:32:06.7417280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.7417619Z moe/activation_test.py:115: in fn
2025-05-07T20:32:06.7417900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.7418463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:06.7419028Z     return fn(*args, **kwargs)
2025-05-07T20:32:06.7419685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:06.7420470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:06.7421014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.7421699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.7422359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.7422892Z     kernel = self.compile(
2025-05-07T20:32:06.7423435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.7424097Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.7424490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.7424728Z 
2025-05-07T20:32:06.7424936Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd440400>
2025-05-07T20:32:06.7426021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.7427410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bcfa68c0>}
2025-05-07T20:32:06.7428739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.7429775Z context = <triton._C.libtriton.ir.context object at 0x7f06ab86acf0>
2025-05-07T20:32:06.7430067Z 
2025-05-07T20:32:06.7430239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.7430761Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.7431235Z                            module_map=module_map)
2025-05-07T20:32:06.7431609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.7431967Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.7432223Z E       ^
2025-05-07T20:32:06.7432693Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.7433144Z 
2025-05-07T20:32:06.7433567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.7434079Z 
2025-05-07T20:32:06.7434191Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.7434598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.7435000Z     T=1,
2025-05-07T20:32:06.7435187Z     D=5120,
2025-05-07T20:32:06.7435379Z     scale_ub=None,
2025-05-07T20:32:06.7435599Z     contiguous=False,
2025-05-07T20:32:06.7435935Z     compiled=True,
2025-05-07T20:32:06.7436144Z )
2025-05-07T20:32:06.8114192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:06.8114719Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:06.8114989Z 
2025-05-07T20:32:06.8115065Z     @given(
2025-05-07T20:32:06.8115303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:06.8115616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:06.8115928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:06.8116261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:06.8116587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:06.8116876Z     )
2025-05-07T20:32:06.8117231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:06.8117684Z     def test_silu_mul_quant(
2025-05-07T20:32:06.8117926Z         self,
2025-05-07T20:32:06.8118128Z         T: int,
2025-05-07T20:32:06.8118339Z         D: int,
2025-05-07T20:32:06.8118558Z         scale_ub: Optional[float],
2025-05-07T20:32:06.8118841Z         contiguous: bool,
2025-05-07T20:32:06.8119088Z         compiled: bool,
2025-05-07T20:32:06.8119311Z     ) -> None:
2025-05-07T20:32:06.8119531Z         torch.manual_seed(2025)
2025-05-07T20:32:06.8119772Z     
2025-05-07T20:32:06.8120040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:06.8120381Z     
2025-05-07T20:32:06.8120578Z         x_sign = torch.sign(x)
2025-05-07T20:32:06.8120868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:06.8121181Z         x = x_sign * x_clamp
2025-05-07T20:32:06.8121426Z         x0 = x[:, :D]
2025-05-07T20:32:06.8121640Z         x1 = x[:, D:]
2025-05-07T20:32:06.8121853Z     
2025-05-07T20:32:06.8122043Z         if contiguous:
2025-05-07T20:32:06.8122277Z             x0 = x0.contiguous()
2025-05-07T20:32:06.8122539Z             x1 = x1.contiguous()
2025-05-07T20:32:06.8122790Z     
2025-05-07T20:32:06.8122987Z         if scale_ub is not None:
2025-05-07T20:32:06.8123257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:06.8123601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:06.8123912Z             )
2025-05-07T20:32:06.8124103Z         else:
2025-05-07T20:32:06.8124320Z             scale_ub_tensor = None
2025-05-07T20:32:06.8124574Z     
2025-05-07T20:32:06.8124802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.8125118Z             op = silu_mul_quant
2025-05-07T20:32:06.8125375Z             if compiled:
2025-05-07T20:32:06.8125623Z                 op = torch.compile(op)
2025-05-07T20:32:06.8125924Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:06.8126203Z     
2025-05-07T20:32:06.8126394Z         y_fp8, y_scale = fn()
2025-05-07T20:32:06.8126681Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:06.8126980Z     
2025-05-07T20:32:06.8127219Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:06.8127553Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:06.8127858Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:06.8128178Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:06.8128537Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.8128850Z     
2025-05-07T20:32:06.8129055Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:06.8129251Z 
2025-05-07T20:32:06.8129352Z moe/activation_test.py:126: 
2025-05-07T20:32:06.8129654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.8129988Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:06.8130319Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:06.8131105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:06.8132066Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:06.8132700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:06.8133379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:06.8134070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:06.8134793Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.8135558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:06.8136298Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:06.8137027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:06.8137684Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:06.8138301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:06.8138814Z     fn()
2025-05-07T20:32:06.8139330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:06.8140015Z     self.fn.run(
2025-05-07T20:32:06.8140506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:06.8141041Z     kernel = self.compile(
2025-05-07T20:32:06.8149580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:06.8150267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.8150676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:06.8150909Z 
2025-05-07T20:32:06.8151129Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc7d4070>
2025-05-07T20:32:06.8152224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:06.8153648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc8a7880>}
2025-05-07T20:32:06.8155000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:06.8156043Z context = <triton._C.libtriton.ir.context object at 0x7f06ab81b7f0>
2025-05-07T20:32:06.8156332Z 
2025-05-07T20:32:06.8156503Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:06.8157039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.8157506Z                            module_map=module_map)
2025-05-07T20:32:06.8157873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.8158241Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:06.8158516Z E       ^
2025-05-07T20:32:06.8158990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.8159438Z 
2025-05-07T20:32:06.8159867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:06.8160395Z 
2025-05-07T20:32:06.8160503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:06.8160927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:06.8161333Z     T=1,
2025-05-07T20:32:06.8161519Z     D=5120,
2025-05-07T20:32:06.8161859Z     scale_ub=None,
2025-05-07T20:32:06.8162087Z     contiguous=True,
2025-05-07T20:32:06.8162312Z     compiled=False,
2025-05-07T20:32:06.8162664Z )
2025-05-07T20:32:07.1606876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.1607422Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:07.1607685Z 
2025-05-07T20:32:07.1607775Z     @given(
2025-05-07T20:32:07.1608009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.1608333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.1608649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.1608989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.1609324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.1609613Z     )
2025-05-07T20:32:07.1609969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.1610436Z     def test_silu_mul_quant(
2025-05-07T20:32:07.1610682Z         self,
2025-05-07T20:32:07.1610884Z         T: int,
2025-05-07T20:32:07.1611080Z         D: int,
2025-05-07T20:32:07.1611317Z         scale_ub: Optional[float],
2025-05-07T20:32:07.1611595Z         contiguous: bool,
2025-05-07T20:32:07.1611835Z         compiled: bool,
2025-05-07T20:32:07.1612070Z     ) -> None:
2025-05-07T20:32:07.1612297Z         torch.manual_seed(2025)
2025-05-07T20:32:07.1612536Z     
2025-05-07T20:32:07.1612816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.1613166Z     
2025-05-07T20:32:07.1613360Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.1613662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.1613976Z         x = x_sign * x_clamp
2025-05-07T20:32:07.1614227Z         x0 = x[:, :D]
2025-05-07T20:32:07.1614443Z         x1 = x[:, D:]
2025-05-07T20:32:07.1614657Z     
2025-05-07T20:32:07.1614856Z         if contiguous:
2025-05-07T20:32:07.1615097Z             x0 = x0.contiguous()
2025-05-07T20:32:07.1615365Z             x1 = x1.contiguous()
2025-05-07T20:32:07.1615605Z     
2025-05-07T20:32:07.1615800Z         if scale_ub is not None:
2025-05-07T20:32:07.1616081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.1616425Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.1616729Z             )
2025-05-07T20:32:07.1616929Z         else:
2025-05-07T20:32:07.1617152Z             scale_ub_tensor = None
2025-05-07T20:32:07.1617415Z     
2025-05-07T20:32:07.1617650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.1617959Z             op = silu_mul_quant
2025-05-07T20:32:07.1618218Z             if compiled:
2025-05-07T20:32:07.1618470Z                 op = torch.compile(op)
2025-05-07T20:32:07.1618766Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.1619035Z     
2025-05-07T20:32:07.1619233Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.1619397Z 
2025-05-07T20:32:07.1619506Z moe/activation_test.py:117: 
2025-05-07T20:32:07.1619898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.1620236Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.1620525Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.1621215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.1621911Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.1622450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.1623132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.1623787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.1624325Z     kernel = self.compile(
2025-05-07T20:32:07.1624868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.1626031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.1626432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.1626664Z 
2025-05-07T20:32:07.1626872Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc7d57b0>
2025-05-07T20:32:07.1627953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.1629323Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc8a6b00>}
2025-05-07T20:32:07.1630646Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.1631690Z context = <triton._C.libtriton.ir.context object at 0x7f06aba90f30>
2025-05-07T20:32:07.1631979Z 
2025-05-07T20:32:07.1632151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.1632672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.1633143Z                            module_map=module_map)
2025-05-07T20:32:07.1633514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.1633866Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.1634128Z E       ^
2025-05-07T20:32:07.1634591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.1636293Z 
2025-05-07T20:32:07.1636715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.1637231Z 
2025-05-07T20:32:07.1637342Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.1637756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.1638154Z     T=128,
2025-05-07T20:32:07.1638344Z     D=5120,
2025-05-07T20:32:07.1638532Z     scale_ub=None,
2025-05-07T20:32:07.1638752Z     contiguous=False,
2025-05-07T20:32:07.1638981Z     compiled=True,
2025-05-07T20:32:07.1639185Z )
2025-05-07T20:32:07.1639509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.1640018Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:07.1640294Z 
2025-05-07T20:32:07.1640375Z     @given(
2025-05-07T20:32:07.1640612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.1640918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.1641226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.1641562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.1641887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.1642172Z     )
2025-05-07T20:32:07.1642525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.1642961Z     def test_silu_mul_quant(
2025-05-07T20:32:07.1643201Z         self,
2025-05-07T20:32:07.1643398Z         T: int,
2025-05-07T20:32:07.1643593Z         D: int,
2025-05-07T20:32:07.1643814Z         scale_ub: Optional[float],
2025-05-07T20:32:07.1644090Z         contiguous: bool,
2025-05-07T20:32:07.1644331Z         compiled: bool,
2025-05-07T20:32:07.1644551Z     ) -> None:
2025-05-07T20:32:07.1644772Z         torch.manual_seed(2025)
2025-05-07T20:32:07.1645014Z     
2025-05-07T20:32:07.1645283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.1645624Z     
2025-05-07T20:32:07.1645820Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.1646201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.1646511Z         x = x_sign * x_clamp
2025-05-07T20:32:07.1646830Z         x0 = x[:, :D]
2025-05-07T20:32:07.1647047Z         x1 = x[:, D:]
2025-05-07T20:32:07.1647256Z     
2025-05-07T20:32:07.1647442Z         if contiguous:
2025-05-07T20:32:07.1647668Z             x0 = x0.contiguous()
2025-05-07T20:32:07.1647927Z             x1 = x1.contiguous()
2025-05-07T20:32:07.1648171Z     
2025-05-07T20:32:07.1648360Z         if scale_ub is not None:
2025-05-07T20:32:07.1648632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.1648966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.1649280Z             )
2025-05-07T20:32:07.1649467Z         else:
2025-05-07T20:32:07.1649681Z             scale_ub_tensor = None
2025-05-07T20:32:07.1649934Z     
2025-05-07T20:32:07.1650162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.1650477Z             op = silu_mul_quant
2025-05-07T20:32:07.1650741Z             if compiled:
2025-05-07T20:32:07.1650993Z                 op = torch.compile(op)
2025-05-07T20:32:07.1651301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.1651576Z     
2025-05-07T20:32:07.1651768Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.1651941Z 
2025-05-07T20:32:07.1652044Z moe/activation_test.py:117: 
2025-05-07T20:32:07.1652344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.1652672Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.1652960Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.1653518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.1654087Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.1654746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.1655442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.1655985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.1656660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.1657320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.1657850Z     kernel = self.compile(
2025-05-07T20:32:07.1658391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.1659042Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.1659436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.1659662Z 
2025-05-07T20:32:07.1659970Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc8c3c40>
2025-05-07T20:32:07.1661043Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.1662419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc56b370>}
2025-05-07T20:32:07.1663752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.1664767Z context = <triton._C.libtriton.ir.context object at 0x7f06aba72cf0>
2025-05-07T20:32:07.1665055Z 
2025-05-07T20:32:07.1665230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.1665745Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.1666306Z                            module_map=module_map)
2025-05-07T20:32:07.1666669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.1667132Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.1667394Z E       ^
2025-05-07T20:32:07.1667860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.1668305Z 
2025-05-07T20:32:07.1668723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.1669228Z 
2025-05-07T20:32:07.1669338Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.1669746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.1670146Z     T=128,
2025-05-07T20:32:07.1670340Z     D=7168,
2025-05-07T20:32:07.1670528Z     scale_ub=1200.0,
2025-05-07T20:32:07.1670756Z     contiguous=False,
2025-05-07T20:32:07.1670988Z     compiled=False,
2025-05-07T20:32:07.1671201Z )
2025-05-07T20:32:07.2942602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.2943200Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:07.2943475Z 
2025-05-07T20:32:07.2943560Z     @given(
2025-05-07T20:32:07.2943790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.2944104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.2944413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.2944740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.2945075Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.2945363Z     )
2025-05-07T20:32:07.2945705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.2946158Z     def test_silu_mul_quant(
2025-05-07T20:32:07.2946401Z         self,
2025-05-07T20:32:07.2946601Z         T: int,
2025-05-07T20:32:07.2946805Z         D: int,
2025-05-07T20:32:07.2947030Z         scale_ub: Optional[float],
2025-05-07T20:32:07.2947301Z         contiguous: bool,
2025-05-07T20:32:07.2947546Z         compiled: bool,
2025-05-07T20:32:07.2947773Z     ) -> None:
2025-05-07T20:32:07.2947992Z         torch.manual_seed(2025)
2025-05-07T20:32:07.2948231Z     
2025-05-07T20:32:07.2948504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.2948843Z     
2025-05-07T20:32:07.2949034Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.2949330Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.2949640Z         x = x_sign * x_clamp
2025-05-07T20:32:07.2949902Z         x0 = x[:, :D]
2025-05-07T20:32:07.2950122Z         x1 = x[:, D:]
2025-05-07T20:32:07.2950323Z     
2025-05-07T20:32:07.2950508Z         if contiguous:
2025-05-07T20:32:07.2950743Z             x0 = x0.contiguous()
2025-05-07T20:32:07.2950995Z             x1 = x1.contiguous()
2025-05-07T20:32:07.2951235Z     
2025-05-07T20:32:07.2951435Z         if scale_ub is not None:
2025-05-07T20:32:07.2951702Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.2952042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.2952354Z             )
2025-05-07T20:32:07.2952547Z         else:
2025-05-07T20:32:07.2952767Z             scale_ub_tensor = None
2025-05-07T20:32:07.2953015Z     
2025-05-07T20:32:07.2953249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.2953553Z             op = silu_mul_quant
2025-05-07T20:32:07.2953805Z             if compiled:
2025-05-07T20:32:07.2954052Z                 op = torch.compile(op)
2025-05-07T20:32:07.2954345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.2954619Z     
2025-05-07T20:32:07.2954818Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.2954984Z 
2025-05-07T20:32:07.2955082Z moe/activation_test.py:117: 
2025-05-07T20:32:07.2955377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.2956007Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.2956285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.2957092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.2957797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.2958337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.2959011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.2959673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.2960210Z     kernel = self.compile(
2025-05-07T20:32:07.2960756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.2961404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.2961811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.2962041Z 
2025-05-07T20:32:07.2962258Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bce8ba60>
2025-05-07T20:32:07.2963332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.2964689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc56a560>}
2025-05-07T20:32:07.2966034Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.2967054Z context = <triton._C.libtriton.ir.context object at 0x7f06abad5ff0>
2025-05-07T20:32:07.2967337Z 
2025-05-07T20:32:07.2967515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.2968031Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.2968505Z                            module_map=module_map)
2025-05-07T20:32:07.2968872Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.2969227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.2969487Z E       ^
2025-05-07T20:32:07.2969955Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.2970399Z 
2025-05-07T20:32:07.2970823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.2971340Z 
2025-05-07T20:32:07.2971456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.2971865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.2972269Z     T=128,
2025-05-07T20:32:07.2972460Z     D=5120,
2025-05-07T20:32:07.2972652Z     scale_ub=None,
2025-05-07T20:32:07.2972873Z     contiguous=False,
2025-05-07T20:32:07.2973107Z     compiled=False,
2025-05-07T20:32:07.2973307Z )
2025-05-07T20:32:07.2973628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.2974119Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:07.2974384Z 
2025-05-07T20:32:07.2974462Z     @given(
2025-05-07T20:32:07.2974695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.2975011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.2975318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.2975650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.2975982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.2976362Z     )
2025-05-07T20:32:07.2976784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.2977235Z     def test_silu_mul_quant(
2025-05-07T20:32:07.2977483Z         self,
2025-05-07T20:32:07.2977673Z         T: int,
2025-05-07T20:32:07.2977873Z         D: int,
2025-05-07T20:32:07.2978095Z         scale_ub: Optional[float],
2025-05-07T20:32:07.2978360Z         contiguous: bool,
2025-05-07T20:32:07.2978604Z         compiled: bool,
2025-05-07T20:32:07.2978832Z     ) -> None:
2025-05-07T20:32:07.2979044Z         torch.manual_seed(2025)
2025-05-07T20:32:07.2979288Z     
2025-05-07T20:32:07.2979563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.2979997Z     
2025-05-07T20:32:07.2980189Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.2980483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.2980790Z         x = x_sign * x_clamp
2025-05-07T20:32:07.2981033Z         x0 = x[:, :D]
2025-05-07T20:32:07.2981248Z         x1 = x[:, D:]
2025-05-07T20:32:07.2981461Z     
2025-05-07T20:32:07.2981646Z         if contiguous:
2025-05-07T20:32:07.2981884Z             x0 = x0.contiguous()
2025-05-07T20:32:07.2982142Z             x1 = x1.contiguous()
2025-05-07T20:32:07.2982378Z     
2025-05-07T20:32:07.2982583Z         if scale_ub is not None:
2025-05-07T20:32:07.2982907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.2983236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.2983550Z             )
2025-05-07T20:32:07.2983746Z         else:
2025-05-07T20:32:07.2983959Z             scale_ub_tensor = None
2025-05-07T20:32:07.2984211Z     
2025-05-07T20:32:07.2984444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.2984754Z             op = silu_mul_quant
2025-05-07T20:32:07.2985010Z             if compiled:
2025-05-07T20:32:07.2985262Z                 op = torch.compile(op)
2025-05-07T20:32:07.2985569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.2985836Z     
2025-05-07T20:32:07.2986038Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.2986205Z 
2025-05-07T20:32:07.2986312Z moe/activation_test.py:117: 
2025-05-07T20:32:07.2986608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.2986939Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.2987227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.2987904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.2988593Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.2989127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.2989802Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.2990742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.2991282Z     kernel = self.compile(
2025-05-07T20:32:07.2991824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.2992476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.2992864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.2993097Z 
2025-05-07T20:32:07.2993303Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abbbbc40>
2025-05-07T20:32:07.2994367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.2995722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc568550>}
2025-05-07T20:32:07.2997291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.2998328Z context = <triton._C.libtriton.ir.context object at 0x7f06ab70d9f0>
2025-05-07T20:32:07.2998618Z 
2025-05-07T20:32:07.2998783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.2999302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.2999760Z                            module_map=module_map)
2025-05-07T20:32:07.3000128Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.3000482Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.3000738Z E       ^
2025-05-07T20:32:07.3001204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3001659Z 
2025-05-07T20:32:07.3002086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.3002595Z 
2025-05-07T20:32:07.3002706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.3003110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.3003507Z     T=128,
2025-05-07T20:32:07.3003695Z     D=5120,
2025-05-07T20:32:07.3003883Z     scale_ub=1200.0,
2025-05-07T20:32:07.3004111Z     contiguous=True,
2025-05-07T20:32:07.3004341Z     compiled=False,
2025-05-07T20:32:07.3004546Z )
2025-05-07T20:32:07.4946503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.4947414Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:07.4947890Z 
2025-05-07T20:32:07.4948025Z     @given(
2025-05-07T20:32:07.4948442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.4948961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.4949463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.4950012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.4950562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.4951061Z     )
2025-05-07T20:32:07.4951619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.4952302Z     def test_silu_mul_quant(
2025-05-07T20:32:07.4952668Z         self,
2025-05-07T20:32:07.4952979Z         T: int,
2025-05-07T20:32:07.4953284Z         D: int,
2025-05-07T20:32:07.4953623Z         scale_ub: Optional[float],
2025-05-07T20:32:07.4954069Z         contiguous: bool,
2025-05-07T20:32:07.4954445Z         compiled: bool,
2025-05-07T20:32:07.4954805Z     ) -> None:
2025-05-07T20:32:07.4955175Z         torch.manual_seed(2025)
2025-05-07T20:32:07.4966365Z     
2025-05-07T20:32:07.4966856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.4967437Z     
2025-05-07T20:32:07.4967772Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.4968255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.4968764Z         x = x_sign * x_clamp
2025-05-07T20:32:07.4969165Z         x0 = x[:, :D]
2025-05-07T20:32:07.4969518Z         x1 = x[:, D:]
2025-05-07T20:32:07.4969840Z     
2025-05-07T20:32:07.4970142Z         if contiguous:
2025-05-07T20:32:07.4970520Z             x0 = x0.contiguous()
2025-05-07T20:32:07.4970943Z             x1 = x1.contiguous()
2025-05-07T20:32:07.4971349Z     
2025-05-07T20:32:07.4971663Z         if scale_ub is not None:
2025-05-07T20:32:07.4972107Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.4972685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.4973219Z             )
2025-05-07T20:32:07.4973539Z         else:
2025-05-07T20:32:07.4974330Z             scale_ub_tensor = None
2025-05-07T20:32:07.4974759Z     
2025-05-07T20:32:07.4975153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.4975876Z             op = silu_mul_quant
2025-05-07T20:32:07.4976296Z             if compiled:
2025-05-07T20:32:07.4976719Z                 op = torch.compile(op)
2025-05-07T20:32:07.4977216Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.4977677Z     
2025-05-07T20:32:07.4977992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.4978262Z 
2025-05-07T20:32:07.4978421Z moe/activation_test.py:117: 
2025-05-07T20:32:07.4978915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.4979460Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.4980031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.4981182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.4982403Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.4983314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.4984425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.4985557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.4986474Z     kernel = self.compile(
2025-05-07T20:32:07.4987397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.4988525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.4989207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.4989593Z 
2025-05-07T20:32:07.4990147Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc02a8f0>
2025-05-07T20:32:07.4991988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.4994323Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc568700>}
2025-05-07T20:32:07.4996701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.4998524Z context = <triton._C.libtriton.ir.context object at 0x7f06ab764f70>
2025-05-07T20:32:07.4999002Z 
2025-05-07T20:32:07.4999290Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.5000166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.5000988Z                            module_map=module_map)
2025-05-07T20:32:07.5001600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.5002189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.5002641Z E       ^
2025-05-07T20:32:07.5003459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.5004261Z 
2025-05-07T20:32:07.5005004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.5005901Z 
2025-05-07T20:32:07.5006073Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.5006769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.5007465Z     T=1,
2025-05-07T20:32:07.5007762Z     D=7168,
2025-05-07T20:32:07.5008071Z     scale_ub=1200.0,
2025-05-07T20:32:07.5008429Z     contiguous=True,
2025-05-07T20:32:07.5009012Z     compiled=True,
2025-05-07T20:32:07.5009334Z )
2025-05-07T20:32:07.5009865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.5010856Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:07.5011315Z 
2025-05-07T20:32:07.5011438Z     @given(
2025-05-07T20:32:07.5011819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.5012340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.5012840Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.5013390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.5013941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.5014428Z     )
2025-05-07T20:32:07.5015009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.5015755Z     def test_silu_mul_quant(
2025-05-07T20:32:07.5016152Z         self,
2025-05-07T20:32:07.5016457Z         T: int,
2025-05-07T20:32:07.5016781Z         D: int,
2025-05-07T20:32:07.5017118Z         scale_ub: Optional[float],
2025-05-07T20:32:07.5017566Z         contiguous: bool,
2025-05-07T20:32:07.5017962Z         compiled: bool,
2025-05-07T20:32:07.5018316Z     ) -> None:
2025-05-07T20:32:07.5018660Z         torch.manual_seed(2025)
2025-05-07T20:32:07.5019044Z     
2025-05-07T20:32:07.5019474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.5020186Z     
2025-05-07T20:32:07.5020504Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.5020966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.5021489Z         x = x_sign * x_clamp
2025-05-07T20:32:07.5021889Z         x0 = x[:, :D]
2025-05-07T20:32:07.5022227Z         x1 = x[:, D:]
2025-05-07T20:32:07.5022562Z     
2025-05-07T20:32:07.5022855Z         if contiguous:
2025-05-07T20:32:07.5023216Z             x0 = x0.contiguous()
2025-05-07T20:32:07.5023635Z             x1 = x1.contiguous()
2025-05-07T20:32:07.5024033Z     
2025-05-07T20:32:07.5024330Z         if scale_ub is not None:
2025-05-07T20:32:07.5024779Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.5025345Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.5025856Z             )
2025-05-07T20:32:07.5026163Z         else:
2025-05-07T20:32:07.5026501Z             scale_ub_tensor = None
2025-05-07T20:32:07.5026922Z     
2025-05-07T20:32:07.5027290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.5027820Z             op = silu_mul_quant
2025-05-07T20:32:07.5028229Z             if compiled:
2025-05-07T20:32:07.5028626Z                 op = torch.compile(op)
2025-05-07T20:32:07.5029117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.5029580Z     
2025-05-07T20:32:07.5029882Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.5030165Z 
2025-05-07T20:32:07.5030327Z moe/activation_test.py:117: 
2025-05-07T20:32:07.5030819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.5031388Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.5031852Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.5032832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.5033772Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.5034885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.5036109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.5037073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.5038203Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.5039225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.5040148Z     kernel = self.compile(
2025-05-07T20:32:07.5041267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.5042536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.5043221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.5043629Z 
2025-05-07T20:32:07.5043975Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abbb9750>
2025-05-07T20:32:07.5045908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.5048409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bcd18280>}
2025-05-07T20:32:07.5050853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.5052712Z context = <triton._C.libtriton.ir.context object at 0x7f06abb75af0>
2025-05-07T20:32:07.5053214Z 
2025-05-07T20:32:07.5053504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.5054418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.5055222Z                            module_map=module_map)
2025-05-07T20:32:07.5055833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.5056423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.5056850Z E       ^
2025-05-07T20:32:07.5057662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.5058468Z 
2025-05-07T20:32:07.5059226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.5060226Z 
2025-05-07T20:32:07.5060412Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.5061090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.5061754Z     T=1,
2025-05-07T20:32:07.5062041Z     D=7168,
2025-05-07T20:32:07.5062333Z     scale_ub=1200.0,
2025-05-07T20:32:07.5062693Z     contiguous=False,
2025-05-07T20:32:07.5063048Z     compiled=True,
2025-05-07T20:32:07.5063365Z )
2025-05-07T20:32:07.6447793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.6448657Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:07.6449106Z 
2025-05-07T20:32:07.6449243Z     @given(
2025-05-07T20:32:07.6449613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.6450136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.6450646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.6451182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.6451749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.6452233Z     )
2025-05-07T20:32:07.6452843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.6453513Z     def test_silu_mul_quant(
2025-05-07T20:32:07.6453876Z         self,
2025-05-07T20:32:07.6454171Z         T: int,
2025-05-07T20:32:07.6454472Z         D: int,
2025-05-07T20:32:07.6454819Z         scale_ub: Optional[float],
2025-05-07T20:32:07.6455232Z         contiguous: bool,
2025-05-07T20:32:07.6455594Z         compiled: bool,
2025-05-07T20:32:07.6455941Z     ) -> None:
2025-05-07T20:32:07.6456289Z         torch.manual_seed(2025)
2025-05-07T20:32:07.6456693Z     
2025-05-07T20:32:07.6457143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.6457690Z     
2025-05-07T20:32:07.6458419Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.6458897Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.6459632Z         x = x_sign * x_clamp
2025-05-07T20:32:07.6460169Z         x0 = x[:, :D]
2025-05-07T20:32:07.6460514Z         x1 = x[:, D:]
2025-05-07T20:32:07.6460845Z     
2025-05-07T20:32:07.6461144Z         if contiguous:
2025-05-07T20:32:07.6461517Z             x0 = x0.contiguous()
2025-05-07T20:32:07.6461952Z             x1 = x1.contiguous()
2025-05-07T20:32:07.6462349Z     
2025-05-07T20:32:07.6462676Z         if scale_ub is not None:
2025-05-07T20:32:07.6463152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.6463701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.6464200Z             )
2025-05-07T20:32:07.6464517Z         else:
2025-05-07T20:32:07.6464858Z             scale_ub_tensor = None
2025-05-07T20:32:07.6465270Z     
2025-05-07T20:32:07.6465650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6466198Z             op = silu_mul_quant
2025-05-07T20:32:07.6466607Z             if compiled:
2025-05-07T20:32:07.6467031Z                 op = torch.compile(op)
2025-05-07T20:32:07.6467528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6467980Z     
2025-05-07T20:32:07.6468292Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.6468581Z 
2025-05-07T20:32:07.6468743Z moe/activation_test.py:117: 
2025-05-07T20:32:07.6469237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6469807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.6470264Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6471230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.6472186Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.6473334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.6474557Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.6475514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.6476703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.6477890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.6478817Z     kernel = self.compile(
2025-05-07T20:32:07.6479753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.6480877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.6481523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6481901Z 
2025-05-07T20:32:07.6482241Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abbb85b0>
2025-05-07T20:32:07.6484176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.6486585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf75b0>}
2025-05-07T20:32:07.6488955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.6491126Z context = <triton._C.libtriton.ir.context object at 0x7f06abb68bb0>
2025-05-07T20:32:07.6491619Z 
2025-05-07T20:32:07.6491902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.6492772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.6493783Z                            module_map=module_map)
2025-05-07T20:32:07.6494511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.6495058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.6495474Z E       ^
2025-05-07T20:32:07.6496265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6497058Z 
2025-05-07T20:32:07.6497796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.6498703Z 
2025-05-07T20:32:07.6498871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.6499575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.6500380Z     T=1,
2025-05-07T20:32:07.6500664Z     D=7168,
2025-05-07T20:32:07.6500967Z     scale_ub=None,
2025-05-07T20:32:07.6501327Z     contiguous=False,
2025-05-07T20:32:07.6501681Z     compiled=True,
2025-05-07T20:32:07.6502015Z )
2025-05-07T20:32:07.7460206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.7461052Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:07.7461477Z 
2025-05-07T20:32:07.7461603Z     @given(
2025-05-07T20:32:07.7462003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.7462518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.7463007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.7463538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.7464093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.7464581Z     )
2025-05-07T20:32:07.7465085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.7465714Z     def test_silu_mul_quant(
2025-05-07T20:32:07.7466058Z         self,
2025-05-07T20:32:07.7466346Z         T: int,
2025-05-07T20:32:07.7466616Z         D: int,
2025-05-07T20:32:07.7466923Z         scale_ub: Optional[float],
2025-05-07T20:32:07.7467337Z         contiguous: bool,
2025-05-07T20:32:07.7467684Z         compiled: bool,
2025-05-07T20:32:07.7468028Z     ) -> None:
2025-05-07T20:32:07.7468358Z         torch.manual_seed(2025)
2025-05-07T20:32:07.7468728Z     
2025-05-07T20:32:07.7469145Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.7469652Z     
2025-05-07T20:32:07.7469933Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.7470376Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.7470852Z         x = x_sign * x_clamp
2025-05-07T20:32:07.7471223Z         x0 = x[:, :D]
2025-05-07T20:32:07.7471560Z         x1 = x[:, D:]
2025-05-07T20:32:07.7471900Z     
2025-05-07T20:32:07.7472190Z         if contiguous:
2025-05-07T20:32:07.7472561Z             x0 = x0.contiguous()
2025-05-07T20:32:07.7473025Z             x1 = x1.contiguous()
2025-05-07T20:32:07.7473433Z     
2025-05-07T20:32:07.7473764Z         if scale_ub is not None:
2025-05-07T20:32:07.7474223Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.7474748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.7475255Z             )
2025-05-07T20:32:07.7475563Z         else:
2025-05-07T20:32:07.7475902Z             scale_ub_tensor = None
2025-05-07T20:32:07.7476299Z     
2025-05-07T20:32:07.7476663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.7477182Z             op = silu_mul_quant
2025-05-07T20:32:07.7477575Z             if compiled:
2025-05-07T20:32:07.7477981Z                 op = torch.compile(op)
2025-05-07T20:32:07.7478478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.7478931Z     
2025-05-07T20:32:07.7479249Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.7479727Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.7480635Z     
2025-05-07T20:32:07.7481023Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.7481779Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.7482280Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.7482811Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.7483426Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.7483960Z     
2025-05-07T20:32:07.7484282Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.7484624Z 
2025-05-07T20:32:07.7484789Z moe/activation_test.py:126: 
2025-05-07T20:32:07.7485286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.7485847Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.7486408Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.7487799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.7489129Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.7490430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.7491637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.7492902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.7494199Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.7495527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.7496884Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.7498155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.7499226Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.7500384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.7501289Z     fn()
2025-05-07T20:32:07.7502150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.7503113Z     self.fn.run(
2025-05-07T20:32:07.7503883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.7504795Z     kernel = self.compile(
2025-05-07T20:32:07.7505710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.7506838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.7507519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.7507926Z 
2025-05-07T20:32:07.7508272Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abb355a0>
2025-05-07T20:32:07.7510134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
﻿2025-05-07T20:32:07.7516467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf5bd0>}
2025-05-07T20:32:07.7518566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.7520155Z context = <triton._C.libtriton.ir.context object at 0x7f06ab630b70>
2025-05-07T20:32:07.7520586Z 
2025-05-07T20:32:07.7520840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.7521978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.7522725Z                            module_map=module_map)
2025-05-07T20:32:07.7523307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.7523842Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.7524276Z E       ^
2025-05-07T20:32:07.7524980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.7525709Z 
2025-05-07T20:32:07.7526353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.7527157Z 
2025-05-07T20:32:07.7527332Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.7527974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.7528615Z     T=1,
2025-05-07T20:32:07.7528911Z     D=5120,
2025-05-07T20:32:07.7529230Z     scale_ub=1200.0,
2025-05-07T20:32:07.7529571Z     contiguous=False,
2025-05-07T20:32:07.7529924Z     compiled=True,
2025-05-07T20:32:07.7530253Z )
2025-05-07T20:32:08.1027264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.1028171Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.1028646Z 
2025-05-07T20:32:08.1028770Z     @given(
2025-05-07T20:32:08.1029142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.1029660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.1030158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.1030702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.1031264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.1031757Z     )
2025-05-07T20:32:08.1032291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.1033026Z     def test_silu_mul_quant(
2025-05-07T20:32:08.1033402Z         self,
2025-05-07T20:32:08.1033722Z         T: int,
2025-05-07T20:32:08.1034039Z         D: int,
2025-05-07T20:32:08.1034361Z         scale_ub: Optional[float],
2025-05-07T20:32:08.1034790Z         contiguous: bool,
2025-05-07T20:32:08.1035166Z         compiled: bool,
2025-05-07T20:32:08.1035530Z     ) -> None:
2025-05-07T20:32:08.1035891Z         torch.manual_seed(2025)
2025-05-07T20:32:08.1036268Z     
2025-05-07T20:32:08.1036693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.1037263Z     
2025-05-07T20:32:08.1037575Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.1038040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.1038544Z         x = x_sign * x_clamp
2025-05-07T20:32:08.1038936Z         x0 = x[:, :D]
2025-05-07T20:32:08.1039277Z         x1 = x[:, D:]
2025-05-07T20:32:08.1039629Z     
2025-05-07T20:32:08.1039944Z         if contiguous:
2025-05-07T20:32:08.1040324Z             x0 = x0.contiguous()
2025-05-07T20:32:08.1040744Z             x1 = x1.contiguous()
2025-05-07T20:32:08.1041152Z     
2025-05-07T20:32:08.1041473Z         if scale_ub is not None:
2025-05-07T20:32:08.1051249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.1051877Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.1052765Z             )
2025-05-07T20:32:08.1053141Z         else:
2025-05-07T20:32:08.1053502Z             scale_ub_tensor = None
2025-05-07T20:32:08.1053933Z     
2025-05-07T20:32:08.1054326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.1054856Z             op = silu_mul_quant
2025-05-07T20:32:08.1055267Z             if compiled:
2025-05-07T20:32:08.1055685Z                 op = torch.compile(op)
2025-05-07T20:32:08.1056194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.1056659Z     
2025-05-07T20:32:08.1056963Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.1057409Z 
2025-05-07T20:32:08.1057576Z moe/activation_test.py:117: 
2025-05-07T20:32:08.1058282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.1058849Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.1059322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.1060434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.1061340Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.1062529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.1063710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.1064601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.1065739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.1066893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.1067818Z     kernel = self.compile(
2025-05-07T20:32:08.1068753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.1069891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.1070565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.1070935Z 
2025-05-07T20:32:08.1071285Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab616230>
2025-05-07T20:32:08.1073015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.1075463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf43a0>}
2025-05-07T20:32:08.1077840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.1079641Z context = <triton._C.libtriton.ir.context object at 0x7f06ab6a67b0>
2025-05-07T20:32:08.1080146Z 
2025-05-07T20:32:08.1080431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.1081292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.1082097Z                            module_map=module_map)
2025-05-07T20:32:08.1082703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.1083293Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.1083713Z E       ^
2025-05-07T20:32:08.1084518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.1085311Z 
2025-05-07T20:32:08.1086065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.1086984Z 
2025-05-07T20:32:08.1087146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.1087839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.1088652Z     T=1,
2025-05-07T20:32:08.1088949Z     D=5120,
2025-05-07T20:32:08.1089254Z     scale_ub=1200.0,
2025-05-07T20:32:08.1089620Z     contiguous=False,
2025-05-07T20:32:08.1090348Z     compiled=False,
2025-05-07T20:32:08.1090679Z )
2025-05-07T20:32:08.1091209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.1092032Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.1092491Z 
2025-05-07T20:32:08.1092758Z     @given(
2025-05-07T20:32:08.1093185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.1093862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.1094373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.1094927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.1095478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.1095966Z     )
2025-05-07T20:32:08.1096547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.1097308Z     def test_silu_mul_quant(
2025-05-07T20:32:08.1097716Z         self,
2025-05-07T20:32:08.1098019Z         T: int,
2025-05-07T20:32:08.1098340Z         D: int,
2025-05-07T20:32:08.1098694Z         scale_ub: Optional[float],
2025-05-07T20:32:08.1099143Z         contiguous: bool,
2025-05-07T20:32:08.1099539Z         compiled: bool,
2025-05-07T20:32:08.1100016Z     ) -> None:
2025-05-07T20:32:08.1100359Z         torch.manual_seed(2025)
2025-05-07T20:32:08.1100756Z     
2025-05-07T20:32:08.1101200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.1101779Z     
2025-05-07T20:32:08.1102092Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.1102554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.1103069Z         x = x_sign * x_clamp
2025-05-07T20:32:08.1103468Z         x0 = x[:, :D]
2025-05-07T20:32:08.1103813Z         x1 = x[:, D:]
2025-05-07T20:32:08.1104163Z     
2025-05-07T20:32:08.1104456Z         if contiguous:
2025-05-07T20:32:08.1104827Z             x0 = x0.contiguous()
2025-05-07T20:32:08.1105251Z             x1 = x1.contiguous()
2025-05-07T20:32:08.1105648Z     
2025-05-07T20:32:08.1105955Z         if scale_ub is not None:
2025-05-07T20:32:08.1106412Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.1106977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.1107489Z             )
2025-05-07T20:32:08.1107799Z         else:
2025-05-07T20:32:08.1108146Z             scale_ub_tensor = None
2025-05-07T20:32:08.1108573Z     
2025-05-07T20:32:08.1108947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.1109479Z             op = silu_mul_quant
2025-05-07T20:32:08.1109897Z             if compiled:
2025-05-07T20:32:08.1110293Z                 op = torch.compile(op)
2025-05-07T20:32:08.1110798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.1111263Z     
2025-05-07T20:32:08.1111567Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.1111853Z 
2025-05-07T20:32:08.1112015Z moe/activation_test.py:117: 
2025-05-07T20:32:08.1112513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.1113065Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.1113540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.1114742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.1115924Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.1116827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.1118027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.1119228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.1120241Z     kernel = self.compile(
2025-05-07T20:32:08.1121079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.1122231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.1122918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.1123316Z 
2025-05-07T20:32:08.1123661Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab6152d0>
2025-05-07T20:32:08.1125783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.1128268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf4ee0>}
2025-05-07T20:32:08.1130701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.1132524Z context = <triton._C.libtriton.ir.context object at 0x7f06ab6472f0>
2025-05-07T20:32:08.1133024Z 
2025-05-07T20:32:08.1133297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.1134204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.1135030Z                            module_map=module_map)
2025-05-07T20:32:08.1135647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.1136231Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.1136664Z E       ^
2025-05-07T20:32:08.1137467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.1138274Z 
2025-05-07T20:32:08.1139008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.1140076Z 
2025-05-07T20:32:08.1140249Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.1140955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.1141644Z     T=16384,
2025-05-07T20:32:08.1141939Z     D=5120,
2025-05-07T20:32:08.1142237Z     scale_ub=1200.0,
2025-05-07T20:32:08.1142598Z     contiguous=False,
2025-05-07T20:32:08.1142956Z     compiled=True,
2025-05-07T20:32:08.1143280Z )
2025-05-07T20:32:08.2135643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.2136586Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.2137077Z 
2025-05-07T20:32:08.2137212Z     @given(
2025-05-07T20:32:08.2137591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.2138122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.2138607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.2139100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.2139637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.2140250Z     )
2025-05-07T20:32:08.2140854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.2141622Z     def test_silu_mul_quant(
2025-05-07T20:32:08.2142028Z         self,
2025-05-07T20:32:08.2142354Z         T: int,
2025-05-07T20:32:08.2142668Z         D: int,
2025-05-07T20:32:08.2143027Z         scale_ub: Optional[float],
2025-05-07T20:32:08.2143486Z         contiguous: bool,
2025-05-07T20:32:08.2143878Z         compiled: bool,
2025-05-07T20:32:08.2144251Z     ) -> None:
2025-05-07T20:32:08.2144600Z         torch.manual_seed(2025)
2025-05-07T20:32:08.2144996Z     
2025-05-07T20:32:08.2145448Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.2146347Z     
2025-05-07T20:32:08.2146650Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.2147130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.2147653Z         x = x_sign * x_clamp
2025-05-07T20:32:08.2148047Z         x0 = x[:, :D]
2025-05-07T20:32:08.2148401Z         x1 = x[:, D:]
2025-05-07T20:32:08.2148740Z     
2025-05-07T20:32:08.2149031Z         if contiguous:
2025-05-07T20:32:08.2149419Z             x0 = x0.contiguous()
2025-05-07T20:32:08.2149849Z             x1 = x1.contiguous()
2025-05-07T20:32:08.2150386Z     
2025-05-07T20:32:08.2150697Z         if scale_ub is not None:
2025-05-07T20:32:08.2151389Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.2151960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.2152473Z             )
2025-05-07T20:32:08.2152787Z         else:
2025-05-07T20:32:08.2153129Z             scale_ub_tensor = None
2025-05-07T20:32:08.2153552Z     
2025-05-07T20:32:08.2153928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2154454Z             op = silu_mul_quant
2025-05-07T20:32:08.2154862Z             if compiled:
2025-05-07T20:32:08.2155269Z                 op = torch.compile(op)
2025-05-07T20:32:08.2155766Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2156221Z     
2025-05-07T20:32:08.2156534Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.2156823Z 
2025-05-07T20:32:08.2156987Z moe/activation_test.py:117: 
2025-05-07T20:32:08.2157484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2158042Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.2158522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2159502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.2160470Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.2161586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.2162808Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.2163747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.2164909Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.2166027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.2166957Z     kernel = self.compile(
2025-05-07T20:32:08.2167914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.2169072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.2169757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2170160Z 
2025-05-07T20:32:08.2170515Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab6b3400>
2025-05-07T20:32:08.2172446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.2174977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf69e0>}
2025-05-07T20:32:08.2177432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.2179290Z context = <triton._C.libtriton.ir.context object at 0x7f06ab3148f0>
2025-05-07T20:32:08.2179760Z 
2025-05-07T20:32:08.2180131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.2181138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.2181954Z                            module_map=module_map)
2025-05-07T20:32:08.2182568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.2183158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.2183591Z E       ^
2025-05-07T20:32:08.2184402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2185283Z 
2025-05-07T20:32:08.2186143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.2187075Z 
2025-05-07T20:32:08.2187244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.2187949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.2188640Z     T=2048,
2025-05-07T20:32:08.2188941Z     D=7168,
2025-05-07T20:32:08.2189253Z     scale_ub=1200.0,
2025-05-07T20:32:08.2189623Z     contiguous=False,
2025-05-07T20:32:08.2190304Z     compiled=True,
2025-05-07T20:32:08.2190652Z )
2025-05-07T20:32:08.2191190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.2192047Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.2192524Z 
2025-05-07T20:32:08.2192648Z     @given(
2025-05-07T20:32:08.2193067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.2193601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.2194110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.2194677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.2195237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.2195711Z     )
2025-05-07T20:32:08.2196311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.2197086Z     def test_silu_mul_quant(
2025-05-07T20:32:08.2197484Z         self,
2025-05-07T20:32:08.2197793Z         T: int,
2025-05-07T20:32:08.2198112Z         D: int,
2025-05-07T20:32:08.2198468Z         scale_ub: Optional[float],
2025-05-07T20:32:08.2198912Z         contiguous: bool,
2025-05-07T20:32:08.2199309Z         compiled: bool,
2025-05-07T20:32:08.2199674Z     ) -> None:
2025-05-07T20:32:08.2199990Z         torch.manual_seed(2025)
2025-05-07T20:32:08.2200366Z     
2025-05-07T20:32:08.2200723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.2201169Z     
2025-05-07T20:32:08.2201436Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.2201869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.2202310Z         x = x_sign * x_clamp
2025-05-07T20:32:08.2202669Z         x0 = x[:, :D]
2025-05-07T20:32:08.2202990Z         x1 = x[:, D:]
2025-05-07T20:32:08.2203278Z     
2025-05-07T20:32:08.2203577Z         if contiguous:
2025-05-07T20:32:08.2203944Z             x0 = x0.contiguous()
2025-05-07T20:32:08.2204340Z             x1 = x1.contiguous()
2025-05-07T20:32:08.2204715Z     
2025-05-07T20:32:08.2205020Z         if scale_ub is not None:
2025-05-07T20:32:08.2205464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.2205983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.2206487Z             )
2025-05-07T20:32:08.2206798Z         else:
2025-05-07T20:32:08.2207120Z             scale_ub_tensor = None
2025-05-07T20:32:08.2207533Z     
2025-05-07T20:32:08.2207926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2208426Z             op = silu_mul_quant
2025-05-07T20:32:08.2208842Z             if compiled:
2025-05-07T20:32:08.2209253Z                 op = torch.compile(op)
2025-05-07T20:32:08.2209750Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2210208Z     
2025-05-07T20:32:08.2210513Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.2210941Z 
2025-05-07T20:32:08.2211099Z moe/activation_test.py:117: 
2025-05-07T20:32:08.2211600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2212137Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.2212600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2213556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.2214516Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.2215677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.2217148Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.2218098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.2219319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.2220603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.2221555Z     kernel = self.compile(
2025-05-07T20:32:08.2222478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.2223608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.2224276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2224662Z 
2025-05-07T20:32:08.2225009Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab373d60>
2025-05-07T20:32:08.2226951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.2229304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf7b50>}
2025-05-07T20:32:08.2231659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.2233484Z context = <triton._C.libtriton.ir.context object at 0x7f06ab3db2b0>
2025-05-07T20:32:08.2233980Z 
2025-05-07T20:32:08.2234238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.2235126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.2235925Z                            module_map=module_map)
2025-05-07T20:32:08.2236508Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.2237004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.2237378Z E       ^
2025-05-07T20:32:08.2238092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2238817Z 
2025-05-07T20:32:08.2239484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.2240303Z 
2025-05-07T20:32:08.3522466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3523328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3524022Z     T=1,
2025-05-07T20:32:08.3524324Z     D=5120,
2025-05-07T20:32:08.3524670Z     scale_ub=None,
2025-05-07T20:32:08.3525034Z     contiguous=False,
2025-05-07T20:32:08.3525392Z     compiled=False,
2025-05-07T20:32:08.3525705Z )
2025-05-07T20:32:08.3526226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3526991Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.3527448Z 
2025-05-07T20:32:08.3527574Z     @given(
2025-05-07T20:32:08.3528275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3528795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3529312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3529873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3530422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3530911Z     )
2025-05-07T20:32:08.3531512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3532286Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3532817Z         self,
2025-05-07T20:32:08.3533133Z         T: int,
2025-05-07T20:32:08.3533454Z         D: int,
2025-05-07T20:32:08.3533984Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3534450Z         contiguous: bool,
2025-05-07T20:32:08.3534850Z         compiled: bool,
2025-05-07T20:32:08.3535210Z     ) -> None:
2025-05-07T20:32:08.3535558Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3535968Z     
2025-05-07T20:32:08.3536407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3536987Z     
2025-05-07T20:32:08.3537300Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3537784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3538307Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3538706Z         x0 = x[:, :D]
2025-05-07T20:32:08.3539053Z         x1 = x[:, D:]
2025-05-07T20:32:08.3539398Z     
2025-05-07T20:32:08.3539702Z         if contiguous:
2025-05-07T20:32:08.3540240Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3540670Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3541070Z     
2025-05-07T20:32:08.3541396Z         if scale_ub is not None:
2025-05-07T20:32:08.3541846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3542404Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3542920Z             )
2025-05-07T20:32:08.3543222Z         else:
2025-05-07T20:32:08.3543571Z             scale_ub_tensor = None
2025-05-07T20:32:08.3543987Z     
2025-05-07T20:32:08.3544355Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3544886Z             op = silu_mul_quant
2025-05-07T20:32:08.3545299Z             if compiled:
2025-05-07T20:32:08.3545693Z                 op = torch.compile(op)
2025-05-07T20:32:08.3546186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3546646Z     
2025-05-07T20:32:08.3546946Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.3547230Z 
2025-05-07T20:32:08.3547397Z moe/activation_test.py:117: 
2025-05-07T20:32:08.3547877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3548423Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.3548871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3550054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.3551254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.3552143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3553348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3554514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3555457Z     kernel = self.compile(
2025-05-07T20:32:08.3556402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3557577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3558258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3558655Z 
2025-05-07T20:32:08.3559014Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab33f580>
2025-05-07T20:32:08.3561055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3563577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe945e0>}
2025-05-07T20:32:08.3566018Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3567997Z context = <triton._C.libtriton.ir.context object at 0x7f06abe9d230>
2025-05-07T20:32:08.3568513Z 
2025-05-07T20:32:08.3568789Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3569705Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3570523Z                            module_map=module_map)
2025-05-07T20:32:08.3571142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3571730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3572165Z E       ^
2025-05-07T20:32:08.3572979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3573790Z 
2025-05-07T20:32:08.3574531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3575476Z 
2025-05-07T20:32:08.3575644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3576360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3577050Z     T=4096,
2025-05-07T20:32:08.3577347Z     D=7168,
2025-05-07T20:32:08.3577660Z     scale_ub=1200.0,
2025-05-07T20:32:08.3578028Z     contiguous=False,
2025-05-07T20:32:08.3578425Z     compiled=False,
2025-05-07T20:32:08.3578753Z )
2025-05-07T20:32:08.3579288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3590744Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.3591185Z 
2025-05-07T20:32:08.3591295Z     @given(
2025-05-07T20:32:08.3591612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3592067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3592533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3593029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3593498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3593906Z     )
2025-05-07T20:32:08.3594422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3595078Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3595420Z         self,
2025-05-07T20:32:08.3595701Z         T: int,
2025-05-07T20:32:08.3595983Z         D: int,
2025-05-07T20:32:08.3596298Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3596707Z         contiguous: bool,
2025-05-07T20:32:08.3597070Z         compiled: bool,
2025-05-07T20:32:08.3597388Z     ) -> None:
2025-05-07T20:32:08.3597693Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3598042Z     
2025-05-07T20:32:08.3598427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3598929Z     
2025-05-07T20:32:08.3599210Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3599647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3600148Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3600526Z         x0 = x[:, :D]
2025-05-07T20:32:08.3600864Z         x1 = x[:, D:]
2025-05-07T20:32:08.3601180Z     
2025-05-07T20:32:08.3601471Z         if contiguous:
2025-05-07T20:32:08.3601817Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3602225Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3602793Z     
2025-05-07T20:32:08.3603106Z         if scale_ub is not None:
2025-05-07T20:32:08.3603554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3604117Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3604645Z             )
2025-05-07T20:32:08.3604952Z         else:
2025-05-07T20:32:08.3605295Z             scale_ub_tensor = None
2025-05-07T20:32:08.3605709Z     
2025-05-07T20:32:08.3606062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3606576Z             op = silu_mul_quant
2025-05-07T20:32:08.3607122Z             if compiled:
2025-05-07T20:32:08.3607512Z                 op = torch.compile(op)
2025-05-07T20:32:08.3608158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3608605Z     
2025-05-07T20:32:08.3608903Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.3609165Z 
2025-05-07T20:32:08.3609320Z moe/activation_test.py:117: 
2025-05-07T20:32:08.3609807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3610379Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.3610845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3612097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.3613347Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.3614294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3615522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3616708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3617665Z     kernel = self.compile(
2025-05-07T20:32:08.3618612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3619906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3620598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3620998Z 
2025-05-07T20:32:08.3621356Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abe65d20>
2025-05-07T20:32:08.3623346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3625871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe94ca0>}
2025-05-07T20:32:08.3628320Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3630169Z context = <triton._C.libtriton.ir.context object at 0x7f06abe744b0>
2025-05-07T20:32:08.3630679Z 
2025-05-07T20:32:08.3630969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3631867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3632689Z                            module_map=module_map)
2025-05-07T20:32:08.3633305Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3633895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3634344Z E       ^
2025-05-07T20:32:08.3635154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3635964Z 
2025-05-07T20:32:08.3636712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3637646Z 
2025-05-07T20:32:08.3637912Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3638622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3639306Z     T=16384,
2025-05-07T20:32:08.3639615Z     D=7168,
2025-05-07T20:32:08.3639926Z     scale_ub=None,
2025-05-07T20:32:08.3640266Z     contiguous=True,
2025-05-07T20:32:08.3640627Z     compiled=True,
2025-05-07T20:32:08.3640960Z )
2025-05-07T20:32:08.5595046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5595990Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:08.5596788Z 
2025-05-07T20:32:08.5596916Z     @given(
2025-05-07T20:32:08.5597498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5597999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5598478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5598955Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5599498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5599982Z     )
2025-05-07T20:32:08.5600581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5601340Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5601746Z         self,
2025-05-07T20:32:08.5602070Z         T: int,
2025-05-07T20:32:08.5602390Z         D: int,
2025-05-07T20:32:08.5602740Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5603193Z         contiguous: bool,
2025-05-07T20:32:08.5603590Z         compiled: bool,
2025-05-07T20:32:08.5603956Z     ) -> None:
2025-05-07T20:32:08.5604309Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5604712Z     
2025-05-07T20:32:08.5605165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5605745Z     
2025-05-07T20:32:08.5606058Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5606537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5607069Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5607468Z         x0 = x[:, :D]
2025-05-07T20:32:08.5607816Z         x1 = x[:, D:]
2025-05-07T20:32:08.5608154Z     
2025-05-07T20:32:08.5608453Z         if contiguous:
2025-05-07T20:32:08.5608827Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5609256Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5609655Z     
2025-05-07T20:32:08.5609959Z         if scale_ub is not None:
2025-05-07T20:32:08.5610416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5610977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5611500Z             )
2025-05-07T20:32:08.5611802Z         else:
2025-05-07T20:32:08.5612151Z             scale_ub_tensor = None
2025-05-07T20:32:08.5612580Z     
2025-05-07T20:32:08.5612950Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5613481Z             op = silu_mul_quant
2025-05-07T20:32:08.5613892Z             if compiled:
2025-05-07T20:32:08.5614294Z                 op = torch.compile(op)
2025-05-07T20:32:08.5614795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5615261Z     
2025-05-07T20:32:08.5615565Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5615854Z 
2025-05-07T20:32:08.5616014Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5616510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5617071Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5617533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5618511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.5619503Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.5620756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5621960Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5622850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5624197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5625326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5626236Z     kernel = self.compile(
2025-05-07T20:32:08.5627192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5628351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5629120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5629634Z 
2025-05-07T20:32:08.5629984Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abed3550>
2025-05-07T20:32:08.5631927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5634487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe95b40>}
2025-05-07T20:32:08.5636928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5638787Z context = <triton._C.libtriton.ir.context object at 0x7f06ab428df0>
2025-05-07T20:32:08.5639288Z 
2025-05-07T20:32:08.5639566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5640437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5641240Z                            module_map=module_map)
2025-05-07T20:32:08.5641866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5642463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5642888Z E       ^
2025-05-07T20:32:08.5643697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5644508Z 
2025-05-07T20:32:08.5645260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.5646180Z 
2025-05-07T20:32:08.5646356Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.5647064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.5647758Z     T=4096,
2025-05-07T20:32:08.5648073Z     D=5120,
2025-05-07T20:32:08.5648377Z     scale_ub=None,
2025-05-07T20:32:08.5648728Z     contiguous=False,
2025-05-07T20:32:08.5649100Z     compiled=True,
2025-05-07T20:32:08.5649431Z )
2025-05-07T20:32:08.5649970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5650830Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.5651298Z 
2025-05-07T20:32:08.5651431Z     @given(
2025-05-07T20:32:08.5651803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5652330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5652850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5653401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5653965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5654451Z     )
2025-05-07T20:32:08.5655049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5655823Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5656227Z         self,
2025-05-07T20:32:08.5656533Z         T: int,
2025-05-07T20:32:08.5656855Z         D: int,
2025-05-07T20:32:08.5657213Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5657755Z         contiguous: bool,
2025-05-07T20:32:08.5658154Z         compiled: bool,
2025-05-07T20:32:08.5658522Z     ) -> None:
2025-05-07T20:32:08.5658872Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5659254Z     
2025-05-07T20:32:08.5659677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5660255Z     
2025-05-07T20:32:08.5660498Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5660892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5661324Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5661737Z         x0 = x[:, :D]
2025-05-07T20:32:08.5662024Z         x1 = x[:, D:]
2025-05-07T20:32:08.5662300Z     
2025-05-07T20:32:08.5662692Z         if contiguous:
2025-05-07T20:32:08.5663044Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5663433Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5663759Z     
2025-05-07T20:32:08.5664023Z         if scale_ub is not None:
2025-05-07T20:32:08.5664420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5664905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5665364Z             )
2025-05-07T20:32:08.5665643Z         else:
2025-05-07T20:32:08.5665942Z             scale_ub_tensor = None
2025-05-07T20:32:08.5666293Z     
2025-05-07T20:32:08.5666635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5667106Z             op = silu_mul_quant
2025-05-07T20:32:08.5667466Z             if compiled:
2025-05-07T20:32:08.5667814Z                 op = torch.compile(op)
2025-05-07T20:32:08.5668235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5668623Z     
2025-05-07T20:32:08.5668902Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5669166Z 
2025-05-07T20:32:08.5669311Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5669746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5670216Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5670648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5671586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.5672528Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.5673650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5674825Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5675715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5676872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5677999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5678896Z     kernel = self.compile(
2025-05-07T20:32:08.5679791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5680829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5681457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5681822Z 
2025-05-07T20:32:08.5682162Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab41d060>
2025-05-07T20:32:08.5683978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5686429Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe95240>}
2025-05-07T20:32:08.5688791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5691093Z context = <triton._C.libtriton.ir.context object at 0x7f06ab44f970>
2025-05-07T20:32:08.5691588Z 
2025-05-07T20:32:08.5691864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5692758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5693601Z                            module_map=module_map)
2025-05-07T20:32:08.5694189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5694932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5695376Z E       ^
2025-05-07T20:32:08.5696358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5697171Z 
2025-05-07T20:32:08.5697920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.5698841Z 
2025-05-07T20:32:08.9227839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9228573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9229064Z     T=4096,
2025-05-07T20:32:08.9229265Z     D=5120,
2025-05-07T20:32:08.9229469Z     scale_ub=1200.0,
2025-05-07T20:32:08.9229702Z     contiguous=False,
2025-05-07T20:32:08.9229938Z     compiled=False,
2025-05-07T20:32:08.9230153Z )
2025-05-07T20:32:08.9230483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9231014Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.9231293Z 
2025-05-07T20:32:08.9231391Z     @given(
2025-05-07T20:32:08.9231625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9231945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9232255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9232590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9232916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9233207Z     )
2025-05-07T20:32:08.9233566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9234002Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9234247Z         self,
2025-05-07T20:32:08.9234445Z         T: int,
2025-05-07T20:32:08.9234641Z         D: int,
2025-05-07T20:32:08.9234869Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9235141Z         contiguous: bool,
2025-05-07T20:32:08.9235385Z         compiled: bool,
2025-05-07T20:32:08.9235616Z     ) -> None:
2025-05-07T20:32:08.9235841Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9236084Z     
2025-05-07T20:32:08.9236360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9236707Z     
2025-05-07T20:32:08.9236907Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9237195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9237508Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9237752Z         x0 = x[:, :D]
2025-05-07T20:32:08.9237967Z         x1 = x[:, D:]
2025-05-07T20:32:08.9238177Z     
2025-05-07T20:32:08.9238363Z         if contiguous:
2025-05-07T20:32:08.9238594Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9238859Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9239105Z     
2025-05-07T20:32:08.9239299Z         if scale_ub is not None:
2025-05-07T20:32:08.9239580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9239931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9240238Z             )
2025-05-07T20:32:08.9240446Z         else:
2025-05-07T20:32:08.9240664Z             scale_ub_tensor = None
2025-05-07T20:32:08.9240916Z     
2025-05-07T20:32:08.9241155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9241472Z             op = silu_mul_quant
2025-05-07T20:32:08.9242037Z             if compiled:
2025-05-07T20:32:08.9242285Z                 op = torch.compile(op)
2025-05-07T20:32:08.9242587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9242866Z     
2025-05-07T20:32:08.9243056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.9243230Z 
2025-05-07T20:32:08.9243333Z moe/activation_test.py:117: 
2025-05-07T20:32:08.9243633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9243961Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.9244242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9245167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.9245861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.9246393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9247075Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9247739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9248263Z     kernel = self.compile(
2025-05-07T20:32:08.9248805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9249463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9249859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9250088Z 
2025-05-07T20:32:08.9250295Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab33f790>
2025-05-07T20:32:08.9251374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9252769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe96cb0>}
2025-05-07T20:32:08.9254104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9255121Z context = <triton._C.libtriton.ir.context object at 0x7f06ab4f8d70>
2025-05-07T20:32:08.9255407Z 
2025-05-07T20:32:08.9255576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9256100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9256567Z                            module_map=module_map)
2025-05-07T20:32:08.9256929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9257283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9257550Z E       ^
2025-05-07T20:32:08.9258011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9258462Z 
2025-05-07T20:32:08.9258886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9259399Z 
2025-05-07T20:32:08.9259504Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9260001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9260406Z     T=4096,
2025-05-07T20:32:08.9260601Z     D=5120,
2025-05-07T20:32:08.9260801Z     scale_ub=1200.0,
2025-05-07T20:32:08.9261030Z     contiguous=False,
2025-05-07T20:32:08.9261268Z     compiled=True,
2025-05-07T20:32:08.9261477Z )
2025-05-07T20:32:08.9261801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9262294Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.9262632Z 
2025-05-07T20:32:08.9262712Z     @given(
2025-05-07T20:32:08.9262970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9263334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9263645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9263968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9264303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9264595Z     )
2025-05-07T20:32:08.9275016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9275603Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9275859Z         self,
2025-05-07T20:32:08.9276148Z         T: int,
2025-05-07T20:32:08.9276353Z         D: int,
2025-05-07T20:32:08.9276582Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9276861Z         contiguous: bool,
2025-05-07T20:32:08.9277108Z         compiled: bool,
2025-05-07T20:32:08.9277350Z     ) -> None:
2025-05-07T20:32:08.9277578Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9277823Z     
2025-05-07T20:32:08.9278111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9278462Z     
2025-05-07T20:32:08.9278661Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9278964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9279281Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9279529Z         x0 = x[:, :D]
2025-05-07T20:32:08.9279755Z         x1 = x[:, D:]
2025-05-07T20:32:08.9279994Z     
2025-05-07T20:32:08.9280195Z         if contiguous:
2025-05-07T20:32:08.9280428Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9280708Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9280958Z     
2025-05-07T20:32:08.9281154Z         if scale_ub is not None:
2025-05-07T20:32:08.9281435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9281775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9282090Z             )
2025-05-07T20:32:08.9282293Z         else:
2025-05-07T20:32:08.9282516Z             scale_ub_tensor = None
2025-05-07T20:32:08.9282779Z     
2025-05-07T20:32:08.9283014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9283336Z             op = silu_mul_quant
2025-05-07T20:32:08.9283593Z             if compiled:
2025-05-07T20:32:08.9283841Z                 op = torch.compile(op)
2025-05-07T20:32:08.9284148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9284430Z     
2025-05-07T20:32:08.9284624Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.9284795Z 
2025-05-07T20:32:08.9284898Z moe/activation_test.py:117: 
2025-05-07T20:32:08.9285207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9285541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.9285830Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9286400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.9286965Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.9287618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.9288312Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.9288852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9289527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9290535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9291076Z     kernel = self.compile(
2025-05-07T20:32:08.9291619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9292264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9292790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9293022Z 
2025-05-07T20:32:08.9293273Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab4f05e0>
2025-05-07T20:32:08.9294351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9295709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe96b90>}
2025-05-07T20:32:08.9297228Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9298264Z context = <triton._C.libtriton.ir.context object at 0x7f06ab5a52f0>
2025-05-07T20:32:08.9298550Z 
2025-05-07T20:32:08.9298728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9299240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9299705Z                            module_map=module_map)
2025-05-07T20:32:08.9300189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9300552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9300808Z E       ^
2025-05-07T20:32:08.9301277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9301723Z 
2025-05-07T20:32:08.9302150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9302653Z 
2025-05-07T20:32:09.0593713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0594445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0594993Z     T=2048,
2025-05-07T20:32:09.0595188Z     D=7168,
2025-05-07T20:32:09.0595393Z     scale_ub=1200.0,
2025-05-07T20:32:09.0595632Z     contiguous=False,
2025-05-07T20:32:09.0595863Z     compiled=False,
2025-05-07T20:32:09.0596081Z )
2025-05-07T20:32:09.0596415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0596916Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.0597213Z 
2025-05-07T20:32:09.0597295Z     @given(
2025-05-07T20:32:09.0597540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0597868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0598188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0598530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0598868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0599153Z     )
2025-05-07T20:32:09.0599514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0599962Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0600211Z         self,
2025-05-07T20:32:09.0600419Z         T: int,
2025-05-07T20:32:09.0600626Z         D: int,
2025-05-07T20:32:09.0600847Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0601130Z         contiguous: bool,
2025-05-07T20:32:09.0601387Z         compiled: bool,
2025-05-07T20:32:09.0601615Z     ) -> None:
2025-05-07T20:32:09.0601846Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0602091Z     
2025-05-07T20:32:09.0602368Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0602720Z     
2025-05-07T20:32:09.0602919Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0603210Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0603523Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0604046Z         x0 = x[:, :D]
2025-05-07T20:32:09.0604269Z         x1 = x[:, D:]
2025-05-07T20:32:09.0604477Z     
2025-05-07T20:32:09.0604673Z         if contiguous:
2025-05-07T20:32:09.0604913Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0605172Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0605421Z     
2025-05-07T20:32:09.0605622Z         if scale_ub is not None:
2025-05-07T20:32:09.0605894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0606235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0606548Z             )
2025-05-07T20:32:09.0606833Z         else:
2025-05-07T20:32:09.0607052Z             scale_ub_tensor = None
2025-05-07T20:32:09.0607316Z     
2025-05-07T20:32:09.0607690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0608011Z             op = silu_mul_quant
2025-05-07T20:32:09.0608270Z             if compiled:
2025-05-07T20:32:09.0608520Z                 op = torch.compile(op)
2025-05-07T20:32:09.0608828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0609110Z     
2025-05-07T20:32:09.0609304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0609481Z 
2025-05-07T20:32:09.0609584Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0609887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0610224Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0610507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0611205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0611903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0612452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0613144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0613817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0614358Z     kernel = self.compile(
2025-05-07T20:32:09.0614900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0615559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0615959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0616188Z 
2025-05-07T20:32:09.0616406Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab567df0>
2025-05-07T20:32:09.0617489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0618880Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5dc5e0>}
2025-05-07T20:32:09.0620308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0621333Z context = <triton._C.libtriton.ir.context object at 0x7f06ab5f6730>
2025-05-07T20:32:09.0621623Z 
2025-05-07T20:32:09.0621802Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0622328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0622806Z                            module_map=module_map)
2025-05-07T20:32:09.0623211Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0623588Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0623852Z E       ^
2025-05-07T20:32:09.0624320Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0624831Z 
2025-05-07T20:32:09.0625256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.0625766Z 
2025-05-07T20:32:09.0625871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0626294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0626700Z     T=1,
2025-05-07T20:32:09.0626885Z     D=7168,
2025-05-07T20:32:09.0627082Z     scale_ub=None,
2025-05-07T20:32:09.0627352Z     contiguous=True,
2025-05-07T20:32:09.0627577Z     compiled=False,
2025-05-07T20:32:09.0627786Z )
2025-05-07T20:32:09.0628189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0628681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:09.0628951Z 
2025-05-07T20:32:09.0629035Z     @given(
2025-05-07T20:32:09.0629273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0629594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0629903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0630237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0630566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0630851Z     )
2025-05-07T20:32:09.0631209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0631658Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0631903Z         self,
2025-05-07T20:32:09.0632110Z         T: int,
2025-05-07T20:32:09.0632311Z         D: int,
2025-05-07T20:32:09.0632535Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0632815Z         contiguous: bool,
2025-05-07T20:32:09.0633066Z         compiled: bool,
2025-05-07T20:32:09.0633298Z     ) -> None:
2025-05-07T20:32:09.0633517Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0633768Z     
2025-05-07T20:32:09.0634053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0634396Z     
2025-05-07T20:32:09.0634604Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0634908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0635218Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0635472Z         x0 = x[:, :D]
2025-05-07T20:32:09.0635698Z         x1 = x[:, D:]
2025-05-07T20:32:09.0635906Z     
2025-05-07T20:32:09.0636102Z         if contiguous:
2025-05-07T20:32:09.0636342Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0636608Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0636858Z     
2025-05-07T20:32:09.0637062Z         if scale_ub is not None:
2025-05-07T20:32:09.0637340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0637684Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0637994Z             )
2025-05-07T20:32:09.0638196Z         else:
2025-05-07T20:32:09.0638409Z             scale_ub_tensor = None
2025-05-07T20:32:09.0638675Z     
2025-05-07T20:32:09.0638915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0639230Z             op = silu_mul_quant
2025-05-07T20:32:09.0639491Z             if compiled:
2025-05-07T20:32:09.0639748Z                 op = torch.compile(op)
2025-05-07T20:32:09.0640046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0640328Z     
2025-05-07T20:32:09.0640534Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0640702Z 
2025-05-07T20:32:09.0640802Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0641107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0641445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0641736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0642429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0643125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0643722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0644410Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0645079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0645615Z     kernel = self.compile(
2025-05-07T20:32:09.0646162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0646870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0647352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0647582Z 
2025-05-07T20:32:09.0647802Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab4c4820>
2025-05-07T20:32:09.0648886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0650244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5dcd30>}
2025-05-07T20:32:09.0651587Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0652618Z context = <triton._C.libtriton.ir.context object at 0x7f06ab9e6330>
2025-05-07T20:32:09.0652913Z 
2025-05-07T20:32:09.0653108Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0653672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0654150Z                            module_map=module_map)
2025-05-07T20:32:09.0654528Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0654891Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0655154Z E       ^
2025-05-07T20:32:09.0655627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0656073Z 
2025-05-07T20:32:09.0656497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.0657011Z 
2025-05-07T20:32:09.0657118Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0657544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0657958Z     T=16384,
2025-05-07T20:32:09.0658158Z     D=7168,
2025-05-07T20:32:09.0658363Z     scale_ub=1200.0,
2025-05-07T20:32:09.0658599Z     contiguous=False,
2025-05-07T20:32:09.0658826Z     compiled=True,
2025-05-07T20:32:09.3328847Z )
2025-05-07T20:32:09.3329418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3330126Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.3330441Z 
2025-05-07T20:32:09.3330521Z     @given(
2025-05-07T20:32:09.3330764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3331076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3331376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3331713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3332066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3332350Z     )
2025-05-07T20:32:09.3332713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3333158Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3333396Z         self,
2025-05-07T20:32:09.3333598Z         T: int,
2025-05-07T20:32:09.3334058Z         D: int,
2025-05-07T20:32:09.3334281Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3334548Z         contiguous: bool,
2025-05-07T20:32:09.3334794Z         compiled: bool,
2025-05-07T20:32:09.3335033Z     ) -> None:
2025-05-07T20:32:09.3335244Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3335485Z     
2025-05-07T20:32:09.3335761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3336097Z     
2025-05-07T20:32:09.3336292Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3336587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3336993Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3337237Z         x0 = x[:, :D]
2025-05-07T20:32:09.3337593Z         x1 = x[:, D:]
2025-05-07T20:32:09.3337803Z     
2025-05-07T20:32:09.3337993Z         if contiguous:
2025-05-07T20:32:09.3338228Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3338480Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3338720Z     
2025-05-07T20:32:09.3338920Z         if scale_ub is not None:
2025-05-07T20:32:09.3339193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3339528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3339950Z             )
2025-05-07T20:32:09.3340148Z         else:
2025-05-07T20:32:09.3340358Z             scale_ub_tensor = None
2025-05-07T20:32:09.3340611Z     
2025-05-07T20:32:09.3340845Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3341151Z             op = silu_mul_quant
2025-05-07T20:32:09.3341405Z             if compiled:
2025-05-07T20:32:09.3341661Z                 op = torch.compile(op)
2025-05-07T20:32:09.3341952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3342234Z     
2025-05-07T20:32:09.3342434Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.3342598Z 
2025-05-07T20:32:09.3342698Z moe/activation_test.py:117: 
2025-05-07T20:32:09.3343000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3343337Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.3343621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3344173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.3344731Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.3345388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.3346072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.3346612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3347295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3347961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3348487Z     kernel = self.compile(
2025-05-07T20:32:09.3349032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3349683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3350072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3350303Z 
2025-05-07T20:32:09.3350509Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab584c10>
2025-05-07T20:32:09.3351586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3352961Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5ddbd0>}
2025-05-07T20:32:09.3354289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3355367Z context = <triton._C.libtriton.ir.context object at 0x7f06ab9528f0>
2025-05-07T20:32:09.3355659Z 
2025-05-07T20:32:09.3355824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3356340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3356804Z                            module_map=module_map)
2025-05-07T20:32:09.3357205Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3357630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3357890Z E       ^
2025-05-07T20:32:09.3358373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3358813Z 
2025-05-07T20:32:09.3359225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3359738Z 
2025-05-07T20:32:09.3359847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3360268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3360664Z     T=1,
2025-05-07T20:32:09.3360844Z     D=7168,
2025-05-07T20:32:09.3361037Z     scale_ub=None,
2025-05-07T20:32:09.3361254Z     contiguous=False,
2025-05-07T20:32:09.3361484Z     compiled=False,
2025-05-07T20:32:09.3361692Z )
2025-05-07T20:32:09.3362016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3362501Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:09.3362766Z 
2025-05-07T20:32:09.3362843Z     @given(
2025-05-07T20:32:09.3363076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3363389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3363694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3364022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3364350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3364631Z     )
2025-05-07T20:32:09.3364984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3365422Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3365658Z         self,
2025-05-07T20:32:09.3365856Z         T: int,
2025-05-07T20:32:09.3366056Z         D: int,
2025-05-07T20:32:09.3366277Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3366552Z         contiguous: bool,
2025-05-07T20:32:09.3366798Z         compiled: bool,
2025-05-07T20:32:09.3367029Z     ) -> None:
2025-05-07T20:32:09.3367248Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3367492Z     
2025-05-07T20:32:09.3367761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3368100Z     
2025-05-07T20:32:09.3368301Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3368600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3368906Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3369151Z         x0 = x[:, :D]
2025-05-07T20:32:09.3369376Z         x1 = x[:, D:]
2025-05-07T20:32:09.3369579Z     
2025-05-07T20:32:09.3369767Z         if contiguous:
2025-05-07T20:32:09.3370003Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3370255Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3370520Z     
2025-05-07T20:32:09.3370723Z         if scale_ub is not None:
2025-05-07T20:32:09.3371002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3371333Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3371644Z             )
2025-05-07T20:32:09.3371847Z         else:
2025-05-07T20:32:09.3372057Z             scale_ub_tensor = None
2025-05-07T20:32:09.3372315Z     
2025-05-07T20:32:09.3372552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3372913Z             op = silu_mul_quant
2025-05-07T20:32:09.3373190Z             if compiled:
2025-05-07T20:32:09.3373482Z                 op = torch.compile(op)
2025-05-07T20:32:09.3373777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3374054Z     
2025-05-07T20:32:09.3374251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.3374417Z 
2025-05-07T20:32:09.3374521Z moe/activation_test.py:117: 
2025-05-07T20:32:09.3374811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3375146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.3375482Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3376234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.3386543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.3387165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3387872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3388542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3389084Z     kernel = self.compile(
2025-05-07T20:32:09.3389632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3390684Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3391100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3391330Z 
2025-05-07T20:32:09.3391561Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab9bd750>
2025-05-07T20:32:09.3392632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3394054Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5de050>}
2025-05-07T20:32:09.3395389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3396408Z context = <triton._C.libtriton.ir.context object at 0x7f06ab970330>
2025-05-07T20:32:09.3396698Z 
2025-05-07T20:32:09.3396872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3397414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3397890Z                            module_map=module_map)
2025-05-07T20:32:09.3398264Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3398620Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3398892Z E       ^
2025-05-07T20:32:09.3399366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3399811Z 
2025-05-07T20:32:09.3400227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3400743Z 
2025-05-07T20:32:09.3400853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3401278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3401692Z     T=2048,
2025-05-07T20:32:09.3401888Z     D=7168,
2025-05-07T20:32:09.3402095Z     scale_ub=None,
2025-05-07T20:32:09.3402321Z     contiguous=False,
2025-05-07T20:32:09.3402550Z     compiled=True,
2025-05-07T20:32:09.3402766Z )
2025-05-07T20:32:09.4404626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.4405478Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.4405752Z 
2025-05-07T20:32:09.4405842Z     @given(
2025-05-07T20:32:09.4406075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.4406392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.4406705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.4407030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.4407361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.4407756Z     )
2025-05-07T20:32:09.4408102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.4408685Z     def test_silu_mul_quant(
2025-05-07T20:32:09.4408935Z         self,
2025-05-07T20:32:09.4409130Z         T: int,
2025-05-07T20:32:09.4409327Z         D: int,
2025-05-07T20:32:09.4409548Z         scale_ub: Optional[float],
2025-05-07T20:32:09.4409817Z         contiguous: bool,
2025-05-07T20:32:09.4410066Z         compiled: bool,
2025-05-07T20:32:09.4410295Z     ) -> None:
2025-05-07T20:32:09.4410512Z         torch.manual_seed(2025)
2025-05-07T20:32:09.4410759Z     
2025-05-07T20:32:09.4411040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.4411391Z     
2025-05-07T20:32:09.4411584Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.4411883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.4412203Z         x = x_sign * x_clamp
2025-05-07T20:32:09.4412442Z         x0 = x[:, :D]
2025-05-07T20:32:09.4412672Z         x1 = x[:, D:]
2025-05-07T20:32:09.4412883Z     
2025-05-07T20:32:09.4413069Z         if contiguous:
2025-05-07T20:32:09.4413315Z             x0 = x0.contiguous()
2025-05-07T20:32:09.4413585Z             x1 = x1.contiguous()
2025-05-07T20:32:09.4413825Z     
2025-05-07T20:32:09.4414026Z         if scale_ub is not None:
2025-05-07T20:32:09.4414305Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.4414638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.4414954Z             )
2025-05-07T20:32:09.4415156Z         else:
2025-05-07T20:32:09.4415371Z             scale_ub_tensor = None
2025-05-07T20:32:09.4415631Z     
2025-05-07T20:32:09.4415867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4416186Z             op = silu_mul_quant
2025-05-07T20:32:09.4416437Z             if compiled:
2025-05-07T20:32:09.4416684Z                 op = torch.compile(op)
2025-05-07T20:32:09.4416980Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4417256Z     
2025-05-07T20:32:09.4417448Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.4417612Z 
2025-05-07T20:32:09.4417719Z moe/activation_test.py:117: 
2025-05-07T20:32:09.4418015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4418347Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.4418629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4419193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.4419753Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.4420505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.4421203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.4421741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.4422420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.4423082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.4423611Z     kernel = self.compile(
2025-05-07T20:32:09.4424151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.4424865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4425256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4425489Z 
2025-05-07T20:32:09.4425697Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab97cb80>
2025-05-07T20:32:09.4426770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.4428276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5df1c0>}
2025-05-07T20:32:09.4429614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.4430634Z context = <triton._C.libtriton.ir.context object at 0x7f06aaf99170>
2025-05-07T20:32:09.4430925Z 
2025-05-07T20:32:09.4431100Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.4431627Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4432085Z                            module_map=module_map)
2025-05-07T20:32:09.4432459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4432824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4433081Z E       ^
2025-05-07T20:32:09.4433596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4434053Z 
2025-05-07T20:32:09.4434467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.4434984Z 
2025-05-07T20:32:09.4435099Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.4435507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.4435908Z     T=4096,
2025-05-07T20:32:09.4436099Z     D=7168,
2025-05-07T20:32:09.4436290Z     scale_ub=None,
2025-05-07T20:32:09.4436528Z     contiguous=False,
2025-05-07T20:32:09.4436757Z     compiled=True,
2025-05-07T20:32:09.4436962Z )
2025-05-07T20:32:09.4437274Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.4437767Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.4438044Z 
2025-05-07T20:32:09.4438124Z     @given(
2025-05-07T20:32:09.4438363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.4438673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.4438981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.4439308Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.4439633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.4439923Z     )
2025-05-07T20:32:09.4440277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.4440714Z     def test_silu_mul_quant(
2025-05-07T20:32:09.4440953Z         self,
2025-05-07T20:32:09.4441149Z         T: int,
2025-05-07T20:32:09.4441352Z         D: int,
2025-05-07T20:32:09.4441567Z         scale_ub: Optional[float],
2025-05-07T20:32:09.4441841Z         contiguous: bool,
2025-05-07T20:32:09.4442081Z         compiled: bool,
2025-05-07T20:32:09.4442308Z     ) -> None:
2025-05-07T20:32:09.4442528Z         torch.manual_seed(2025)
2025-05-07T20:32:09.4442775Z     
2025-05-07T20:32:09.4443050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.4443434Z     
2025-05-07T20:32:09.4443643Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.4443931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.4444299Z         x = x_sign * x_clamp
2025-05-07T20:32:09.4444544Z         x0 = x[:, :D]
2025-05-07T20:32:09.4444758Z         x1 = x[:, D:]
2025-05-07T20:32:09.4444970Z     
2025-05-07T20:32:09.4445158Z         if contiguous:
2025-05-07T20:32:09.4445393Z             x0 = x0.contiguous()
2025-05-07T20:32:09.4445658Z             x1 = x1.contiguous()
2025-05-07T20:32:09.4445898Z     
2025-05-07T20:32:09.4446094Z         if scale_ub is not None:
2025-05-07T20:32:09.4446371Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.4446700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.4447059Z             )
2025-05-07T20:32:09.4447249Z         else:
2025-05-07T20:32:09.4447538Z             scale_ub_tensor = None
2025-05-07T20:32:09.4447796Z     
2025-05-07T20:32:09.4448022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4448340Z             op = silu_mul_quant
2025-05-07T20:32:09.4448597Z             if compiled:
2025-05-07T20:32:09.4448846Z                 op = torch.compile(op)
2025-05-07T20:32:09.4449143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4449416Z     
2025-05-07T20:32:09.4449605Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.4449773Z 
2025-05-07T20:32:09.4449877Z moe/activation_test.py:117: 
2025-05-07T20:32:09.4450173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4450503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.4450780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4451335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.4451895Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.4452552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.4453240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.4453772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.4454448Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.4455100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.4455629Z     kernel = self.compile(
2025-05-07T20:32:09.4456164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.4456811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4457212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4457443Z 
2025-05-07T20:32:09.4457648Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aaf11d50>
2025-05-07T20:32:09.4458714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.4460167Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf301f0>}
2025-05-07T20:32:09.4461504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.4462522Z context = <triton._C.libtriton.ir.context object at 0x7f06aafb7f30>
2025-05-07T20:32:09.4462808Z 
2025-05-07T20:32:09.4462986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.4463510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4463968Z                            module_map=module_map)
2025-05-07T20:32:09.4464394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4464753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4465005Z E       ^
2025-05-07T20:32:09.4465473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4465921Z 
2025-05-07T20:32:09.4466346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.4466852Z 
2025-05-07T20:32:09.8116554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.8117543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.8118317Z     T=16384,
2025-05-07T20:32:09.8118572Z     D=5120,
2025-05-07T20:32:09.8118813Z     scale_ub=1200.0,
2025-05-07T20:32:09.8119044Z     contiguous=False,
2025-05-07T20:32:09.8119277Z     compiled=False,
2025-05-07T20:32:09.8119488Z )
2025-05-07T20:32:09.8119813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.8120330Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.8120613Z 
2025-05-07T20:32:09.8120696Z     @given(
2025-05-07T20:32:09.8120936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.8121252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.8121566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.8121896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.8122233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.8122531Z     )
2025-05-07T20:32:09.8122890Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.8123337Z     def test_silu_mul_quant(
2025-05-07T20:32:09.8123590Z         self,
2025-05-07T20:32:09.8123788Z         T: int,
2025-05-07T20:32:09.8123991Z         D: int,
2025-05-07T20:32:09.8124214Z         scale_ub: Optional[float],
2025-05-07T20:32:09.8124488Z         contiguous: bool,
2025-05-07T20:32:09.8124732Z         compiled: bool,
2025-05-07T20:32:09.8124969Z     ) -> None:
2025-05-07T20:32:09.8125186Z         torch.manual_seed(2025)
2025-05-07T20:32:09.8125433Z     
2025-05-07T20:32:09.8125714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.8126064Z     
2025-05-07T20:32:09.8126258Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.8126556Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.8126874Z         x = x_sign * x_clamp
2025-05-07T20:32:09.8127122Z         x0 = x[:, :D]
2025-05-07T20:32:09.8127347Z         x1 = x[:, D:]
2025-05-07T20:32:09.8127563Z     
2025-05-07T20:32:09.8127758Z         if contiguous:
2025-05-07T20:32:09.8128003Z             x0 = x0.contiguous()
2025-05-07T20:32:09.8128273Z             x1 = x1.contiguous()
2025-05-07T20:32:09.8128518Z     
2025-05-07T20:32:09.8128718Z         if scale_ub is not None:
2025-05-07T20:32:09.8129002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.8129341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.8129648Z             )
2025-05-07T20:32:09.8129845Z         else:
2025-05-07T20:32:09.8130061Z             scale_ub_tensor = None
2025-05-07T20:32:09.8130321Z     
2025-05-07T20:32:09.8130557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.8130876Z             op = silu_mul_quant
2025-05-07T20:32:09.8131133Z             if compiled:
2025-05-07T20:32:09.8131385Z                 op = torch.compile(op)
2025-05-07T20:32:09.8131691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8131972Z     
2025-05-07T20:32:09.8132182Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.8132347Z 
2025-05-07T20:32:09.8132450Z moe/activation_test.py:117: 
2025-05-07T20:32:09.8132752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8133092Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.8133460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8134151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.8134842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.8135375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.8136059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.8136769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.8137385Z     kernel = self.compile(
2025-05-07T20:32:09.8137929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.8138585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.8138985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8139214Z 
2025-05-07T20:32:09.8139428Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aafde080>
2025-05-07T20:32:09.8140586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.8141978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf30700>}
2025-05-07T20:32:09.8143321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.8144344Z context = <triton._C.libtriton.ir.context object at 0x7f06ab1011f0>
2025-05-07T20:32:09.8144635Z 
2025-05-07T20:32:09.8144807Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.8145320Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.8145787Z                            module_map=module_map)
2025-05-07T20:32:09.8146156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.8146504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.8146770Z E       ^
2025-05-07T20:32:09.8147235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8147680Z 
2025-05-07T20:32:09.8148109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.8148617Z 
2025-05-07T20:32:09.8148724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.8149139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.8149544Z     T=16384,
2025-05-07T20:32:09.8149738Z     D=5120,
2025-05-07T20:32:09.8149939Z     scale_ub=1200.0,
2025-05-07T20:32:09.8150173Z     contiguous=True,
2025-05-07T20:32:09.8150394Z     compiled=True,
2025-05-07T20:32:09.8150603Z )
2025-05-07T20:32:09.8150925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.8151414Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:09.8151694Z 
2025-05-07T20:32:09.8151771Z     @given(
2025-05-07T20:32:09.8152011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.8152329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.8152640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.8152981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.8153310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.8153593Z     )
2025-05-07T20:32:09.8154006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.8154449Z     def test_silu_mul_quant(
2025-05-07T20:32:09.8154692Z         self,
2025-05-07T20:32:09.8154891Z         T: int,
2025-05-07T20:32:09.8155094Z         D: int,
2025-05-07T20:32:09.8155314Z         scale_ub: Optional[float],
2025-05-07T20:32:09.8155592Z         contiguous: bool,
2025-05-07T20:32:09.8155843Z         compiled: bool,
2025-05-07T20:32:09.8156072Z     ) -> None:
2025-05-07T20:32:09.8156291Z         torch.manual_seed(2025)
2025-05-07T20:32:09.8156588Z     
2025-05-07T20:32:09.8156867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.8157205Z     
2025-05-07T20:32:09.8157483Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.8157783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.8158092Z         x = x_sign * x_clamp
2025-05-07T20:32:09.8158341Z         x0 = x[:, :D]
2025-05-07T20:32:09.8158566Z         x1 = x[:, D:]
2025-05-07T20:32:09.8158772Z     
2025-05-07T20:32:09.8158961Z         if contiguous:
2025-05-07T20:32:09.8159199Z             x0 = x0.contiguous()
2025-05-07T20:32:09.8159459Z             x1 = x1.contiguous()
2025-05-07T20:32:09.8159704Z     
2025-05-07T20:32:09.8159903Z         if scale_ub is not None:
2025-05-07T20:32:09.8160173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.8160514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.8160826Z             )
2025-05-07T20:32:09.8161022Z         else:
2025-05-07T20:32:09.8161238Z             scale_ub_tensor = None
2025-05-07T20:32:09.8161495Z     
2025-05-07T20:32:09.8161737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.8162047Z             op = silu_mul_quant
2025-05-07T20:32:09.8162305Z             if compiled:
2025-05-07T20:32:09.8162559Z                 op = torch.compile(op)
2025-05-07T20:32:09.8162854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8163136Z     
2025-05-07T20:32:09.8163356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.8163544Z 
2025-05-07T20:32:09.8163647Z moe/activation_test.py:117: 
2025-05-07T20:32:09.8163945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8164278Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.8164564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8165121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.8165685Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.8166348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.8167030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.8167574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.8168256Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.8168919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.8169446Z     kernel = self.compile(
2025-05-07T20:32:09.8169990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.8170662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.8171052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8171290Z 
2025-05-07T20:32:09.8171503Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab17e1d0>
2025-05-07T20:32:09.8172574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.8174005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf317e0>}
2025-05-07T20:32:09.8175346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.8176359Z context = <triton._C.libtriton.ir.context object at 0x7f06ab1ec1b0>
2025-05-07T20:32:09.8176699Z 
2025-05-07T20:32:09.8176872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.8177472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.8177950Z                            module_map=module_map)
2025-05-07T20:32:09.8178315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.8178677Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.8178960Z E       ^
2025-05-07T20:32:09.8179438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8179937Z 
2025-05-07T20:32:09.8180352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.8180869Z 
2025-05-07T20:32:10.0085320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0086138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0086849Z     T=16384,
2025-05-07T20:32:10.0087149Z     D=5120,
2025-05-07T20:32:10.0087442Z     scale_ub=None,
2025-05-07T20:32:10.0087779Z     contiguous=False,
2025-05-07T20:32:10.0088124Z     compiled=True,
2025-05-07T20:32:10.0088426Z )
2025-05-07T20:32:10.0088908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0089653Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:10.0090447Z 
2025-05-07T20:32:10.0090563Z     @given(
2025-05-07T20:32:10.0090922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0091394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0091848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0101825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0102225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0102534Z     )
2025-05-07T20:32:10.0102905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0103364Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0103627Z         self,
2025-05-07T20:32:10.0103844Z         T: int,
2025-05-07T20:32:10.0104057Z         D: int,
2025-05-07T20:32:10.0104286Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0104575Z         contiguous: bool,
2025-05-07T20:32:10.0104833Z         compiled: bool,
2025-05-07T20:32:10.0105073Z     ) -> None:
2025-05-07T20:32:10.0105308Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0105567Z     
2025-05-07T20:32:10.0105853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0106212Z     
2025-05-07T20:32:10.0106421Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0106727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0107054Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0107319Z         x0 = x[:, :D]
2025-05-07T20:32:10.0107545Z         x1 = x[:, D:]
2025-05-07T20:32:10.0107774Z     
2025-05-07T20:32:10.0107975Z         if contiguous:
2025-05-07T20:32:10.0108218Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0108500Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0108756Z     
2025-05-07T20:32:10.0108956Z         if scale_ub is not None:
2025-05-07T20:32:10.0109249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0109600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0110184Z             )
2025-05-07T20:32:10.0110381Z         else:
2025-05-07T20:32:10.0110599Z             scale_ub_tensor = None
2025-05-07T20:32:10.0110861Z     
2025-05-07T20:32:10.0111102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0111433Z             op = silu_mul_quant
2025-05-07T20:32:10.0111705Z             if compiled:
2025-05-07T20:32:10.0111964Z                 op = torch.compile(op)
2025-05-07T20:32:10.0112277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0112567Z     
2025-05-07T20:32:10.0112869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0113054Z 
2025-05-07T20:32:10.0113163Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0113615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0113963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0114261Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0114836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.0115415Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.0116078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0116783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0117337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0118033Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0118715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0119263Z     kernel = self.compile(
2025-05-07T20:32:10.0119820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0120488Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0120909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0121155Z 
2025-05-07T20:32:10.0121371Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab18cd60>
2025-05-07T20:32:10.0122467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0123876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf32680>}
2025-05-07T20:32:10.0125229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0126266Z context = <triton._C.libtriton.ir.context object at 0x7f06ab214570>
2025-05-07T20:32:10.0126558Z 
2025-05-07T20:32:10.0126742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0127279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0127756Z                            module_map=module_map)
2025-05-07T20:32:10.0128139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0128512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0128782Z E       ^
2025-05-07T20:32:10.0129263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0129717Z 
2025-05-07T20:32:10.0130149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0130664Z 
2025-05-07T20:32:10.0130785Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0131262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0131671Z     T=2048,
2025-05-07T20:32:10.0131865Z     D=5120,
2025-05-07T20:32:10.0132068Z     scale_ub=None,
2025-05-07T20:32:10.0132292Z     contiguous=False,
2025-05-07T20:32:10.0132519Z     compiled=True,
2025-05-07T20:32:10.0132738Z )
2025-05-07T20:32:10.1168197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.1168816Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:10.1169324Z 
2025-05-07T20:32:10.1169411Z     @given(
2025-05-07T20:32:10.1169652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.1170125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.1170448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.1170781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.1171119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.1171423Z     )
2025-05-07T20:32:10.1171784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.1172228Z     def test_silu_mul_quant(
2025-05-07T20:32:10.1172485Z         self,
2025-05-07T20:32:10.1172694Z         T: int,
2025-05-07T20:32:10.1172903Z         D: int,
2025-05-07T20:32:10.1173138Z         scale_ub: Optional[float],
2025-05-07T20:32:10.1173421Z         contiguous: bool,
2025-05-07T20:32:10.1173667Z         compiled: bool,
2025-05-07T20:32:10.1173910Z     ) -> None:
2025-05-07T20:32:10.1174148Z         torch.manual_seed(2025)
2025-05-07T20:32:10.1174394Z     
2025-05-07T20:32:10.1174685Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.1175035Z     
2025-05-07T20:32:10.1175234Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.1175538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.1175862Z         x = x_sign * x_clamp
2025-05-07T20:32:10.1176115Z         x0 = x[:, :D]
2025-05-07T20:32:10.1176346Z         x1 = x[:, D:]
2025-05-07T20:32:10.1176567Z     
2025-05-07T20:32:10.1176759Z         if contiguous:
2025-05-07T20:32:10.1177003Z             x0 = x0.contiguous()
2025-05-07T20:32:10.1177282Z             x1 = x1.contiguous()
2025-05-07T20:32:10.1177533Z     
2025-05-07T20:32:10.1177732Z         if scale_ub is not None:
2025-05-07T20:32:10.1178014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.1178359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.1178677Z             )
2025-05-07T20:32:10.1178882Z         else:
2025-05-07T20:32:10.1179104Z             scale_ub_tensor = None
2025-05-07T20:32:10.1179361Z     
2025-05-07T20:32:10.1179608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.1180020Z             op = silu_mul_quant
2025-05-07T20:32:10.1180276Z             if compiled:
2025-05-07T20:32:10.1180535Z                 op = torch.compile(op)
2025-05-07T20:32:10.1180846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.1181125Z     
2025-05-07T20:32:10.1181332Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.1181503Z 
2025-05-07T20:32:10.1181614Z moe/activation_test.py:117: 
2025-05-07T20:32:10.1181923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.1182259Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.1182552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.1183123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.1183689Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.1184360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.1185060Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.1185611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.1186382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.1187059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.1187598Z     kernel = self.compile(
2025-05-07T20:32:10.1188142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.1188804Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.1189262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.1189493Z 
2025-05-07T20:32:10.1189795Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab278bb0>
2025-05-07T20:32:10.1191132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.1192521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf32560>}
2025-05-07T20:32:10.1193872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.1194905Z context = <triton._C.libtriton.ir.context object at 0x7f06ab2cb1b0>
2025-05-07T20:32:10.1195199Z 
2025-05-07T20:32:10.1195386Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.1195909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.1196384Z                            module_map=module_map)
2025-05-07T20:32:10.1196768Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.1197131Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.1197407Z E       ^
2025-05-07T20:32:10.1197881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.1198330Z 
2025-05-07T20:32:10.1198764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.1199279Z 
2025-05-07T20:32:10.1199389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.1199812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.1200222Z     T=2048,
2025-05-07T20:32:10.1200412Z     D=5120,
2025-05-07T20:32:10.1200620Z     scale_ub=1200.0,
2025-05-07T20:32:10.1200857Z     contiguous=False,
2025-05-07T20:32:10.1201085Z     compiled=True,
2025-05-07T20:32:10.1201302Z )
2025-05-07T20:32:10.1201633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.1202144Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.1202417Z 
2025-05-07T20:32:10.1202497Z     @given(
2025-05-07T20:32:10.1202738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.1203065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.1203373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.1203711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.1204046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.1204337Z     )
2025-05-07T20:32:10.1204694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.1205147Z     def test_silu_mul_quant(
2025-05-07T20:32:10.1205397Z         self,
2025-05-07T20:32:10.1205595Z         T: int,
2025-05-07T20:32:10.1205803Z         D: int,
2025-05-07T20:32:10.1206032Z         scale_ub: Optional[float],
2025-05-07T20:32:10.1206311Z         contiguous: bool,
2025-05-07T20:32:10.1206662Z         compiled: bool,
2025-05-07T20:32:10.1206894Z     ) -> None:
2025-05-07T20:32:10.1207113Z         torch.manual_seed(2025)
2025-05-07T20:32:10.1207365Z     
2025-05-07T20:32:10.1207648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.1207988Z     
2025-05-07T20:32:10.1208189Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.1208494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.1208812Z         x = x_sign * x_clamp
2025-05-07T20:32:10.1209064Z         x0 = x[:, :D]
2025-05-07T20:32:10.1209364Z         x1 = x[:, D:]
2025-05-07T20:32:10.1209577Z     
2025-05-07T20:32:10.1209775Z         if contiguous:
2025-05-07T20:32:10.1210200Z             x0 = x0.contiguous()
2025-05-07T20:32:10.1210470Z             x1 = x1.contiguous()
2025-05-07T20:32:10.1210725Z     
2025-05-07T20:32:10.1210930Z         if scale_ub is not None:
2025-05-07T20:32:10.1211205Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.1211560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.1211875Z             )
2025-05-07T20:32:10.1212083Z         else:
2025-05-07T20:32:10.1212304Z             scale_ub_tensor = None
2025-05-07T20:32:10.1212569Z     
2025-05-07T20:32:10.1212817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.1213133Z             op = silu_mul_quant
2025-05-07T20:32:10.1213396Z             if compiled:
2025-05-07T20:32:10.1213655Z                 op = torch.compile(op)
2025-05-07T20:32:10.1213962Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.1214251Z     
2025-05-07T20:32:10.1214454Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.1214624Z 
2025-05-07T20:32:10.1214735Z moe/activation_test.py:117: 
2025-05-07T20:32:10.1215040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.1215378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.1215667Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.1216230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.1216797Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.1217464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.1218153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.1218695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.1219392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.1220146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.1220679Z     kernel = self.compile(
2025-05-07T20:32:10.1221226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.1221885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.1222281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.1222520Z 
2025-05-07T20:32:10.1222730Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab2355a0>
2025-05-07T20:32:10.1223864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.1225241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf33370>}
2025-05-07T20:32:10.1226590Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.1227676Z context = <triton._C.libtriton.ir.context object at 0x7f06ab0a1b30>
2025-05-07T20:32:10.1227974Z 
2025-05-07T20:32:10.1228144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.1228682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.1229160Z                            module_map=module_map)
2025-05-07T20:32:10.1229533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.1229943Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.1230213Z E       ^
2025-05-07T20:32:10.1230764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.1231222Z 
2025-05-07T20:32:10.1231640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.1232161Z 
2025-05-07T20:32:10.3143917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3144501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3144913Z     T=4096,
2025-05-07T20:32:10.3145104Z     D=5120,
2025-05-07T20:32:10.3145304Z     scale_ub=1200.0,
2025-05-07T20:32:10.3145533Z     contiguous=True,
2025-05-07T20:32:10.3145754Z     compiled=True,
2025-05-07T20:32:10.3145975Z )
2025-05-07T20:32:10.3146298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3146788Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:10.3147085Z 
2025-05-07T20:32:10.3147167Z     @given(
2025-05-07T20:32:10.3147418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3147731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3148041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3148374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3148711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3148991Z     )
2025-05-07T20:32:10.3149344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3149787Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3150028Z         self,
2025-05-07T20:32:10.3150230Z         T: int,
2025-05-07T20:32:10.3150432Z         D: int,
2025-05-07T20:32:10.3150649Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3150924Z         contiguous: bool,
2025-05-07T20:32:10.3151167Z         compiled: bool,
2025-05-07T20:32:10.3151395Z     ) -> None:
2025-05-07T20:32:10.3151614Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3151860Z     
2025-05-07T20:32:10.3152134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3152480Z     
2025-05-07T20:32:10.3152680Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3152967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3153282Z         x = x_sign * x_clamp
2025-05-07T20:32:10.3153548Z         x0 = x[:, :D]
2025-05-07T20:32:10.3153793Z         x1 = x[:, D:]
2025-05-07T20:32:10.3153996Z     
2025-05-07T20:32:10.3154187Z         if contiguous:
2025-05-07T20:32:10.3154425Z             x0 = x0.contiguous()
2025-05-07T20:32:10.3154682Z             x1 = x1.contiguous()
2025-05-07T20:32:10.3154924Z     
2025-05-07T20:32:10.3155121Z         if scale_ub is not None:
2025-05-07T20:32:10.3155386Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.3155722Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.3156034Z             )
2025-05-07T20:32:10.3156228Z         else:
2025-05-07T20:32:10.3156452Z             scale_ub_tensor = None
2025-05-07T20:32:10.3156709Z     
2025-05-07T20:32:10.3156936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3157251Z             op = silu_mul_quant
2025-05-07T20:32:10.3157504Z             if compiled:
2025-05-07T20:32:10.3158044Z                 op = torch.compile(op)
2025-05-07T20:32:10.3158344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3158618Z     
2025-05-07T20:32:10.3158815Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.3158980Z 
2025-05-07T20:32:10.3159082Z moe/activation_test.py:117: 
2025-05-07T20:32:10.3159381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3159720Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.3160002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3160661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.3161356Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.3162012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.3162701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.3163241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.3163921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.3164573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.3165107Z     kernel = self.compile(
2025-05-07T20:32:10.3165649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.3166307Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3166706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3166938Z 
2025-05-07T20:32:10.3167147Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab071570>
2025-05-07T20:32:10.3168224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.3169606Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01c310>}
2025-05-07T20:32:10.3170935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.3171968Z context = <triton._C.libtriton.ir.context object at 0x7f06ab0d89b0>
2025-05-07T20:32:10.3172259Z 
2025-05-07T20:32:10.3172432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.3172953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3173410Z                            module_map=module_map)
2025-05-07T20:32:10.3173780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3174134Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.3174384Z E       ^
2025-05-07T20:32:10.3174846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3175297Z 
2025-05-07T20:32:10.3175711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.3176219Z 
2025-05-07T20:32:10.3176331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3176735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3177137Z     T=128,
2025-05-07T20:32:10.3177328Z     D=5120,
2025-05-07T20:32:10.3177525Z     scale_ub=1200.0,
2025-05-07T20:32:10.3177743Z     contiguous=False,
2025-05-07T20:32:10.3177969Z     compiled=True,
2025-05-07T20:32:10.3178171Z )
2025-05-07T20:32:10.6252448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6253212Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.6253665Z 
2025-05-07T20:32:10.6253840Z     @given(
2025-05-07T20:32:10.6254241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6254712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6255167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6255655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6256510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6256938Z     )
2025-05-07T20:32:10.6257651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6258328Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6258690Z         self,
2025-05-07T20:32:10.6258982Z         T: int,
2025-05-07T20:32:10.6259270Z         D: int,
2025-05-07T20:32:10.6259599Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6260134Z         contiguous: bool,
2025-05-07T20:32:10.6260493Z         compiled: bool,
2025-05-07T20:32:10.6260841Z     ) -> None:
2025-05-07T20:32:10.6261165Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6261521Z     
2025-05-07T20:32:10.6261934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6262466Z     
2025-05-07T20:32:10.6262757Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6263198Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6263580Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6263830Z         x0 = x[:, :D]
2025-05-07T20:32:10.6264058Z         x1 = x[:, D:]
2025-05-07T20:32:10.6264265Z     
2025-05-07T20:32:10.6264470Z         if contiguous:
2025-05-07T20:32:10.6264712Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6264970Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6265211Z     
2025-05-07T20:32:10.6265413Z         if scale_ub is not None:
2025-05-07T20:32:10.6265693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6266033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6266343Z             )
2025-05-07T20:32:10.6266536Z         else:
2025-05-07T20:32:10.6266756Z             scale_ub_tensor = None
2025-05-07T20:32:10.6267014Z     
2025-05-07T20:32:10.6267246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6277821Z             op = silu_mul_quant
2025-05-07T20:32:10.6278110Z             if compiled:
2025-05-07T20:32:10.6278378Z                 op = torch.compile(op)
2025-05-07T20:32:10.6278702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6278984Z     
2025-05-07T20:32:10.6279202Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.6279372Z 
2025-05-07T20:32:10.6279486Z moe/activation_test.py:117: 
2025-05-07T20:32:10.6279793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6280142Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.6280442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6281010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.6281586Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.6282261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.6282965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.6283506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6284205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6284878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6285423Z     kernel = self.compile(
2025-05-07T20:32:10.6285972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6286762Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6287172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6287405Z 
2025-05-07T20:32:10.6287617Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab0314e0>
2025-05-07T20:32:10.6288708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6290619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01d090>}
2025-05-07T20:32:10.6291970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6292999Z context = <triton._C.libtriton.ir.context object at 0x7f06aac3aff0>
2025-05-07T20:32:10.6293287Z 
2025-05-07T20:32:10.6293456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6293992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6294467Z                            module_map=module_map)
2025-05-07T20:32:10.6294843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6295205Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6295473Z E       ^
2025-05-07T20:32:10.6295954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6296404Z 
2025-05-07T20:32:10.6296824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6297346Z 
2025-05-07T20:32:10.6297456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6297877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6298282Z     T=16384,
2025-05-07T20:32:10.6298474Z     D=7168,
2025-05-07T20:32:10.6298678Z     scale_ub=1200.0,
2025-05-07T20:32:10.6298910Z     contiguous=True,
2025-05-07T20:32:10.6299133Z     compiled=True,
2025-05-07T20:32:10.6299345Z )
2025-05-07T20:32:10.6299667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6300263Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:10.6300552Z 
2025-05-07T20:32:10.6300630Z     @given(
2025-05-07T20:32:10.6300870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6301186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6301501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6301849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6302189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6302478Z     )
2025-05-07T20:32:10.6302839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6303293Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6303548Z         self,
2025-05-07T20:32:10.6303792Z         T: int,
2025-05-07T20:32:10.6304012Z         D: int,
2025-05-07T20:32:10.6304237Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6304524Z         contiguous: bool,
2025-05-07T20:32:10.6304776Z         compiled: bool,
2025-05-07T20:32:10.6305003Z     ) -> None:
2025-05-07T20:32:10.6305239Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6305493Z     
2025-05-07T20:32:10.6305769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6306118Z     
2025-05-07T20:32:10.6306325Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6306698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6307020Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6307274Z         x0 = x[:, :D]
2025-05-07T20:32:10.6307500Z         x1 = x[:, D:]
2025-05-07T20:32:10.6307711Z     
2025-05-07T20:32:10.6307894Z         if contiguous:
2025-05-07T20:32:10.6308137Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6308401Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6308636Z     
2025-05-07T20:32:10.6308835Z         if scale_ub is not None:
2025-05-07T20:32:10.6309116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6309531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6309956Z             )
2025-05-07T20:32:10.6310162Z         else:
2025-05-07T20:32:10.6310382Z             scale_ub_tensor = None
2025-05-07T20:32:10.6310634Z     
2025-05-07T20:32:10.6310874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6311197Z             op = silu_mul_quant
2025-05-07T20:32:10.6311455Z             if compiled:
2025-05-07T20:32:10.6311713Z                 op = torch.compile(op)
2025-05-07T20:32:10.6312015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6312289Z     
2025-05-07T20:32:10.6312491Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.6312658Z 
2025-05-07T20:32:10.6312765Z moe/activation_test.py:117: 
2025-05-07T20:32:10.6313065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6313406Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.6313702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6314268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.6314825Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.6315485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.6316174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.6316709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6317398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6318063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6318597Z     kernel = self.compile(
2025-05-07T20:32:10.6319135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6319800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6320204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6320433Z 
2025-05-07T20:32:10.6320647Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aac8b010>
2025-05-07T20:32:10.6321713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6323079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01e290>}
2025-05-07T20:32:10.6324418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6325447Z context = <triton._C.libtriton.ir.context object at 0x7f06aac45ab0>
2025-05-07T20:32:10.6325735Z 
2025-05-07T20:32:10.6325910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6326426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6326942Z                            module_map=module_map)
2025-05-07T20:32:10.6327312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6327662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6327925Z E       ^
2025-05-07T20:32:10.6328396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6328841Z 
2025-05-07T20:32:10.6329265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6329829Z 
2025-05-07T20:32:10.7673381Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7674450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7674956Z     T=16384,
2025-05-07T20:32:10.7675162Z     D=5120,
2025-05-07T20:32:10.7675365Z     scale_ub=1200.0,
2025-05-07T20:32:10.7675595Z     contiguous=True,
2025-05-07T20:32:10.7675823Z     compiled=False,
2025-05-07T20:32:10.7676056Z )
2025-05-07T20:32:10.7676379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7676885Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.7677167Z 
2025-05-07T20:32:10.7677256Z     @given(
2025-05-07T20:32:10.7677485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7677908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7678345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7678723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7679043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7679334Z     )
2025-05-07T20:32:10.7679686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7680121Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7680368Z         self,
2025-05-07T20:32:10.7680563Z         T: int,
2025-05-07T20:32:10.7680759Z         D: int,
2025-05-07T20:32:10.7680977Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7681246Z         contiguous: bool,
2025-05-07T20:32:10.7681481Z         compiled: bool,
2025-05-07T20:32:10.7681705Z     ) -> None:
2025-05-07T20:32:10.7681923Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7682160Z     
2025-05-07T20:32:10.7682432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7682775Z     
2025-05-07T20:32:10.7682968Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.7683255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.7683570Z         x = x_sign * x_clamp
2025-05-07T20:32:10.7683814Z         x0 = x[:, :D]
2025-05-07T20:32:10.7684032Z         x1 = x[:, D:]
2025-05-07T20:32:10.7684243Z     
2025-05-07T20:32:10.7684429Z         if contiguous:
2025-05-07T20:32:10.7684656Z             x0 = x0.contiguous()
2025-05-07T20:32:10.7684914Z             x1 = x1.contiguous()
2025-05-07T20:32:10.7685153Z     
2025-05-07T20:32:10.7685339Z         if scale_ub is not None:
2025-05-07T20:32:10.7685612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.7685947Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.7686249Z             )
2025-05-07T20:32:10.7686443Z         else:
2025-05-07T20:32:10.7686659Z             scale_ub_tensor = None
2025-05-07T20:32:10.7686905Z     
2025-05-07T20:32:10.7687138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.7687455Z             op = silu_mul_quant
2025-05-07T20:32:10.7687712Z             if compiled:
2025-05-07T20:32:10.7687958Z                 op = torch.compile(op)
2025-05-07T20:32:10.7688264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7688540Z     
2025-05-07T20:32:10.7688728Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.7688901Z 
2025-05-07T20:32:10.7689001Z moe/activation_test.py:117: 
2025-05-07T20:32:10.7689296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7689734Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.7690317Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7691009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.7691694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.7692222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.7692903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.7693762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.7694296Z     kernel = self.compile(
2025-05-07T20:32:10.7694837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.7695494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.7695893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7696118Z 
2025-05-07T20:32:10.7696325Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aac88e80>
2025-05-07T20:32:10.7697397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.7698778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01d1b0>}
2025-05-07T20:32:10.7700223Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.7701248Z context = <triton._C.libtriton.ir.context object at 0x7f06aac6edb0>
2025-05-07T20:32:10.7701536Z 
2025-05-07T20:32:10.7701701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.7702221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.7702690Z                            module_map=module_map)
2025-05-07T20:32:10.7703055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.7703407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.7703674Z E       ^
2025-05-07T20:32:10.7704147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.7704593Z 
2025-05-07T20:32:10.7705009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.7705524Z 
2025-05-07T20:32:10.7705628Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7706049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7706447Z     T=1,
2025-05-07T20:32:10.7706628Z     D=7168,
2025-05-07T20:32:10.7706826Z     scale_ub=1200.0,
2025-05-07T20:32:10.7707054Z     contiguous=False,
2025-05-07T20:32:10.7707277Z     compiled=False,
2025-05-07T20:32:10.7707488Z )
2025-05-07T20:32:10.7707806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7708291Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.7708564Z 
2025-05-07T20:32:10.7708640Z     @given(
2025-05-07T20:32:10.7708933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7709366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7709726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7710322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7710731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7711163Z     )
2025-05-07T20:32:10.7711725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7712277Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7712559Z         self,
2025-05-07T20:32:10.7712927Z         T: int,
2025-05-07T20:32:10.7713211Z         D: int,
2025-05-07T20:32:10.7713472Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7713916Z         contiguous: bool,
2025-05-07T20:32:10.7714242Z         compiled: bool,
2025-05-07T20:32:10.7714508Z     ) -> None:
2025-05-07T20:32:10.7714945Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7715274Z     
2025-05-07T20:32:10.7715705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7716216Z     
2025-05-07T20:32:10.7716498Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.7716940Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.7717322Z         x = x_sign * x_clamp
2025-05-07T20:32:10.7717653Z         x0 = x[:, :D]
2025-05-07T20:32:10.7718017Z         x1 = x[:, D:]
2025-05-07T20:32:10.7718292Z     
2025-05-07T20:32:10.7718565Z         if contiguous:
2025-05-07T20:32:10.7718947Z             x0 = x0.contiguous()
2025-05-07T20:32:10.7719275Z             x1 = x1.contiguous()
2025-05-07T20:32:10.7719667Z     
2025-05-07T20:32:10.7719985Z         if scale_ub is not None:
2025-05-07T20:32:10.7720332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.7720775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.7721206Z             )
2025-05-07T20:32:10.7721499Z         else:
2025-05-07T20:32:10.7721788Z             scale_ub_tensor = None
2025-05-07T20:32:10.7722167Z     
2025-05-07T20:32:10.7722505Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.7722890Z             op = silu_mul_quant
2025-05-07T20:32:10.7723260Z             if compiled:
2025-05-07T20:32:10.7723635Z                 op = torch.compile(op)
2025-05-07T20:32:10.7724034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7724427Z     
2025-05-07T20:32:10.7724745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.7724938Z 
2025-05-07T20:32:10.7725099Z moe/activation_test.py:117: 
2025-05-07T20:32:10.7725479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7725935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.7726303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7727075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.7727888Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.7728514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.7729361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.7730083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.7730744Z     kernel = self.compile(
2025-05-07T20:32:10.7731445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.7732180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.7732630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7732968Z 
2025-05-07T20:32:10.7733244Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aac48040>
2025-05-07T20:32:10.7734424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.7735872Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01e680>}
2025-05-07T20:32:10.7737433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.7738516Z context = <triton._C.libtriton.ir.context object at 0x7f06aadda670>
2025-05-07T20:32:10.7738863Z 
2025-05-07T20:32:10.7739068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.7739874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.7740578Z                            module_map=module_map)
2025-05-07T20:32:10.7740988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.7741500Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.7741876Z E       ^
2025-05-07T20:32:10.7742377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.7742983Z 
2025-05-07T20:32:10.7743431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.7744021Z 
2025-05-07T20:32:10.9655784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9656638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9657294Z     T=4096,
2025-05-07T20:32:10.9657697Z     D=7168,
2025-05-07T20:32:10.9658009Z     scale_ub=1200.0,
2025-05-07T20:32:10.9658305Z     contiguous=False,
2025-05-07T20:32:10.9658675Z     compiled=True,
2025-05-07T20:32:10.9658975Z )
2025-05-07T20:32:10.9659384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9660109Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.9660446Z 
2025-05-07T20:32:10.9660550Z     @given(
2025-05-07T20:32:10.9660882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9661352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9661755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9662177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9662654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9662993Z     )
2025-05-07T20:32:10.9663441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9664024Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9664377Z         self,
2025-05-07T20:32:10.9664613Z         T: int,
2025-05-07T20:32:10.9664953Z         D: int,
2025-05-07T20:32:10.9665286Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9665597Z         contiguous: bool,
2025-05-07T20:32:10.9665985Z         compiled: bool,
2025-05-07T20:32:10.9666364Z     ) -> None:
2025-05-07T20:32:10.9666622Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9667013Z     
2025-05-07T20:32:10.9667395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9667854Z     
2025-05-07T20:32:10.9668116Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9668518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9668962Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9669273Z         x0 = x[:, :D]
2025-05-07T20:32:10.9669598Z         x1 = x[:, D:]
2025-05-07T20:32:10.9669925Z     
2025-05-07T20:32:10.9670201Z         if contiguous:
2025-05-07T20:32:10.9670565Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9670947Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9671274Z     
2025-05-07T20:32:10.9671562Z         if scale_ub is not None:
2025-05-07T20:32:10.9671958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9672420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9672780Z             )
2025-05-07T20:32:10.9673088Z         else:
2025-05-07T20:32:10.9673694Z             scale_ub_tensor = None
2025-05-07T20:32:10.9673998Z     
2025-05-07T20:32:10.9674375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9674838Z             op = silu_mul_quant
2025-05-07T20:32:10.9675144Z             if compiled:
2025-05-07T20:32:10.9675532Z                 op = torch.compile(op)
2025-05-07T20:32:10.9675930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9676254Z     
2025-05-07T20:32:10.9676585Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.9676828Z 
2025-05-07T20:32:10.9676956Z moe/activation_test.py:117: 
2025-05-07T20:32:10.9677458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9678011Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.9678393Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9679065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.9679752Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.9680511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.9681302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.9681970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9682720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9683462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9684132Z     kernel = self.compile(
2025-05-07T20:32:10.9684790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9685501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9686031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9686309Z 
2025-05-07T20:32:10.9686582Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aad6fd30>
2025-05-07T20:32:10.9687741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9689298Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01fb50>}
2025-05-07T20:32:10.9691059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9692147Z context = <triton._C.libtriton.ir.context object at 0x7f06aade5d30>
2025-05-07T20:32:10.9692520Z 
2025-05-07T20:32:10.9692787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9693396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9693897Z                            module_map=module_map)
2025-05-07T20:32:10.9694437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9694871Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.9695170Z E       ^
2025-05-07T20:32:10.9695836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9696340Z 
2025-05-07T20:32:10.9696797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.9697343Z 
2025-05-07T20:32:10.9697610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9698096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9698657Z     T=128,
2025-05-07T20:32:10.9698998Z     D=7168,
2025-05-07T20:32:10.9699262Z     scale_ub=1200.0,
2025-05-07T20:32:10.9699573Z     contiguous=False,
2025-05-07T20:32:10.9700039Z     compiled=True,
2025-05-07T20:32:10.9700342Z )
2025-05-07T20:32:11.0728026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.0728762Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.0729152Z 
2025-05-07T20:32:11.0729264Z     @given(
2025-05-07T20:32:11.0729592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.0730170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.0730620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.0730969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.0731310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.0731602Z     )
2025-05-07T20:32:11.0731964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.0732422Z     def test_silu_mul_quant(
2025-05-07T20:32:11.0732670Z         self,
2025-05-07T20:32:11.0732874Z         T: int,
2025-05-07T20:32:11.0733081Z         D: int,
2025-05-07T20:32:11.0733303Z         scale_ub: Optional[float],
2025-05-07T20:32:11.0733587Z         contiguous: bool,
2025-05-07T20:32:11.0733834Z         compiled: bool,
2025-05-07T20:32:11.0734064Z     ) -> None:
2025-05-07T20:32:11.0734290Z         torch.manual_seed(2025)
2025-05-07T20:32:11.0734546Z     
2025-05-07T20:32:11.0734837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.0735184Z     
2025-05-07T20:32:11.0735397Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.0735699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.0736009Z         x = x_sign * x_clamp
2025-05-07T20:32:11.0736262Z         x0 = x[:, :D]
2025-05-07T20:32:11.0736485Z         x1 = x[:, D:]
2025-05-07T20:32:11.0736697Z     
2025-05-07T20:32:11.0736892Z         if contiguous:
2025-05-07T20:32:11.0737133Z             x0 = x0.contiguous()
2025-05-07T20:32:11.0737392Z             x1 = x1.contiguous()
2025-05-07T20:32:11.0737641Z     
2025-05-07T20:32:11.0737844Z         if scale_ub is not None:
2025-05-07T20:32:11.0738119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.0738459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.0738775Z             )
2025-05-07T20:32:11.0738970Z         else:
2025-05-07T20:32:11.0739190Z             scale_ub_tensor = None
2025-05-07T20:32:11.0739457Z     
2025-05-07T20:32:11.0739689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.0740087Z             op = silu_mul_quant
2025-05-07T20:32:11.0740353Z             if compiled:
2025-05-07T20:32:11.0740638Z                 op = torch.compile(op)
2025-05-07T20:32:11.0740946Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.0741221Z     
2025-05-07T20:32:11.0741429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.0741597Z 
2025-05-07T20:32:11.0741712Z moe/activation_test.py:117: 
2025-05-07T20:32:11.0742010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.0742353Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.0742649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.0743209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.0743774Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.0744455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.0745151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.0745685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.0746371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.0747118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.0747664Z     kernel = self.compile(
2025-05-07T20:32:11.0748208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.0748879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.0749281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.0749556Z 
2025-05-07T20:32:11.0749766Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aad826b0>
2025-05-07T20:32:11.0750919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.0752321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4c670>}
2025-05-07T20:32:11.0753648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.0754660Z context = <triton._C.libtriton.ir.context object at 0x7f06aab49ab0>
2025-05-07T20:32:11.0754950Z 
2025-05-07T20:32:11.0755117Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.0755645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.0756109Z                            module_map=module_map)
2025-05-07T20:32:11.0756477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.0756828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.0757085Z E       ^
2025-05-07T20:32:11.0757549Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.0757998Z 
2025-05-07T20:32:11.0758416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.0758928Z 
2025-05-07T20:32:11.0759040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.0759444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.0759845Z     T=2048,
2025-05-07T20:32:11.0760042Z     D=7168,
2025-05-07T20:32:11.0760231Z     scale_ub=None,
2025-05-07T20:32:11.0760445Z     contiguous=True,
2025-05-07T20:32:11.0760672Z     compiled=True,
2025-05-07T20:32:11.0760875Z )
2025-05-07T20:32:11.0761195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.0761687Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.0761956Z 
2025-05-07T20:32:11.0762038Z     @given(
2025-05-07T20:32:11.0762263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.0762575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.0762881Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.0763202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.0763527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.0763815Z     )
2025-05-07T20:32:11.0764156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.0764598Z     def test_silu_mul_quant(
2025-05-07T20:32:11.0764843Z         self,
2025-05-07T20:32:11.0765040Z         T: int,
2025-05-07T20:32:11.0765233Z         D: int,
2025-05-07T20:32:11.0765454Z         scale_ub: Optional[float],
2025-05-07T20:32:11.0765721Z         contiguous: bool,
2025-05-07T20:32:11.0765954Z         compiled: bool,
2025-05-07T20:32:11.0766182Z     ) -> None:
2025-05-07T20:32:11.0766453Z         torch.manual_seed(2025)
2025-05-07T20:32:11.0766686Z     
2025-05-07T20:32:11.0766957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.0767296Z     
2025-05-07T20:32:11.0767485Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.0767783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.0768094Z         x = x_sign * x_clamp
2025-05-07T20:32:11.0768331Z         x0 = x[:, :D]
2025-05-07T20:32:11.0768551Z         x1 = x[:, D:]
2025-05-07T20:32:11.0768757Z     
2025-05-07T20:32:11.0768942Z         if contiguous:
2025-05-07T20:32:11.0769222Z             x0 = x0.contiguous()
2025-05-07T20:32:11.0769485Z             x1 = x1.contiguous()
2025-05-07T20:32:11.0769795Z     
2025-05-07T20:32:11.0769996Z         if scale_ub is not None:
2025-05-07T20:32:11.0770270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.0770601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.0770900Z             )
2025-05-07T20:32:11.0771097Z         else:
2025-05-07T20:32:11.0771315Z             scale_ub_tensor = None
2025-05-07T20:32:11.0771561Z     
2025-05-07T20:32:11.0771796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.0772108Z             op = silu_mul_quant
2025-05-07T20:32:11.0772354Z             if compiled:
2025-05-07T20:32:11.0772608Z                 op = torch.compile(op)
2025-05-07T20:32:11.0772906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.0773171Z     
2025-05-07T20:32:11.0773371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.0773537Z 
2025-05-07T20:32:11.0773645Z moe/activation_test.py:117: 
2025-05-07T20:32:11.0773941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.0774272Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.0774559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.0775111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.0775662Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.0776314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.0776995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.0777524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.0778199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.0778858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.0779391Z     kernel = self.compile(
2025-05-07T20:32:11.0779987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.0780638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.0781034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.0781260Z 
2025-05-07T20:32:11.0781470Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aab9f520>
2025-05-07T20:32:11.0782526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.0783884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4d1b0>}
2025-05-07T20:32:11.0785227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.0786238Z context = <triton._C.libtriton.ir.context object at 0x7f06aabd8770>
2025-05-07T20:32:11.0786573Z 
2025-05-07T20:32:11.0786743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.0787258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.0787722Z                            module_map=module_map)
2025-05-07T20:32:11.0788088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.0788438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.0788696Z E       ^
2025-05-07T20:32:11.0789157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.0789644Z 
2025-05-07T20:32:11.0790475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.0790996Z 
2025-05-07T20:32:11.1627967Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1628694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1629288Z     T=16384,
2025-05-07T20:32:11.1629540Z     D=5120,
2025-05-07T20:32:11.1629740Z     scale_ub=None,
2025-05-07T20:32:11.1629958Z     contiguous=False,
2025-05-07T20:32:11.1630191Z     compiled=False,
2025-05-07T20:32:11.1630402Z )
2025-05-07T20:32:11.1630714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1631212Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.1631488Z 
2025-05-07T20:32:11.1631579Z     @given(
2025-05-07T20:32:11.1631809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1632131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1632439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1632771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1633097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1633386Z     )
2025-05-07T20:32:11.1633740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1634175Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1634424Z         self,
2025-05-07T20:32:11.1634625Z         T: int,
2025-05-07T20:32:11.1634825Z         D: int,
2025-05-07T20:32:11.1635053Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1635326Z         contiguous: bool,
2025-05-07T20:32:11.1635586Z         compiled: bool,
2025-05-07T20:32:11.1635807Z     ) -> None:
2025-05-07T20:32:11.1636025Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1636272Z     
2025-05-07T20:32:11.1636543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1636888Z     
2025-05-07T20:32:11.1637091Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1637378Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1639386Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.1641259Z 
2025-05-07T20:32:11.1641383Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.1641604Z 
2025-05-07T20:32:11.1641707Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1642121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1642512Z     T=4096,
2025-05-07T20:32:11.1642706Z     D=7168,
2025-05-07T20:32:11.1642901Z     scale_ub=1200.0,
2025-05-07T20:32:11.1643123Z     contiguous=True,
2025-05-07T20:32:11.1643346Z     compiled=True,
2025-05-07T20:32:11.1643746Z )
2025-05-07T20:32:11.1644064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1644547Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.1644817Z 
2025-05-07T20:32:11.1644901Z     @given(
2025-05-07T20:32:11.1645130Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1645431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1645738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1646065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1646475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1646760Z     )
2025-05-07T20:32:11.1647227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1647676Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1647917Z         self,
2025-05-07T20:32:11.1648114Z         T: int,
2025-05-07T20:32:11.1648318Z         D: int,
2025-05-07T20:32:11.1648535Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1648806Z         contiguous: bool,
2025-05-07T20:32:11.1649048Z         compiled: bool,
2025-05-07T20:32:11.1649267Z     ) -> None:
2025-05-07T20:32:11.1649486Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1649729Z     
2025-05-07T20:32:11.1649994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1650336Z     
2025-05-07T20:32:11.1650532Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1650818Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1652805Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.1654705Z 
2025-05-07T20:32:11.1654825Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.1655044Z 
2025-05-07T20:32:11.1655146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1655560Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1655954Z     T=16384,
2025-05-07T20:32:11.1656147Z     D=7168,
2025-05-07T20:32:11.1656346Z     scale_ub=None,
2025-05-07T20:32:11.1656554Z     contiguous=False,
2025-05-07T20:32:11.1656776Z     compiled=False,
2025-05-07T20:32:11.1656982Z )
2025-05-07T20:32:11.1657292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1657782Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.1658062Z 
2025-05-07T20:32:11.1658140Z     @given(
2025-05-07T20:32:11.1658373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1658675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1658979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1659311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1659631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1660035Z     )
2025-05-07T20:32:11.1660381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1660814Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1661057Z         self,
2025-05-07T20:32:11.1661252Z         T: int,
2025-05-07T20:32:11.1661446Z         D: int,
2025-05-07T20:32:11.1661666Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1661938Z         contiguous: bool,
2025-05-07T20:32:11.1662176Z         compiled: bool,
2025-05-07T20:32:11.1662392Z     ) -> None:
2025-05-07T20:32:11.1662606Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1662927Z     
2025-05-07T20:32:11.1663190Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1665344Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.1667264Z 
2025-05-07T20:32:11.1667383Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.1667591Z 
2025-05-07T20:32:11.1667700Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1668110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1668501Z     T=2048,
2025-05-07T20:32:11.1668694Z     D=7168,
2025-05-07T20:32:11.1668888Z     scale_ub=1200.0,
2025-05-07T20:32:11.1669104Z     contiguous=True,
2025-05-07T20:32:11.1669326Z     compiled=True,
2025-05-07T20:32:11.1669532Z )
2025-05-07T20:32:11.1669841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1670329Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.1670595Z 
2025-05-07T20:32:11.1670683Z     @given(
2025-05-07T20:32:11.1670914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1671224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1671536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1671865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1672189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1672473Z     )
2025-05-07T20:32:11.1672827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1673259Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1673500Z         self,
2025-05-07T20:32:11.1673718Z         T: int,
2025-05-07T20:32:11.1673935Z         D: int,
2025-05-07T20:32:11.1674157Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1674428Z         contiguous: bool,
2025-05-07T20:32:11.1674662Z         compiled: bool,
2025-05-07T20:32:11.1674884Z     ) -> None:
2025-05-07T20:32:11.1675098Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1675333Z     
2025-05-07T20:32:11.1675608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1675947Z     
2025-05-07T20:32:11.1676139Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1676433Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1678404Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.1680270Z 
2025-05-07T20:32:11.1680387Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.1680603Z 
2025-05-07T20:32:11.1680711Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1681120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1681515Z     T=2048,
2025-05-07T20:32:11.1681705Z     D=7168,
2025-05-07T20:32:11.1681889Z     scale_ub=None,
2025-05-07T20:32:11.1682099Z     contiguous=True,
2025-05-07T20:32:11.1682323Z     compiled=False,
2025-05-07T20:32:11.1682581Z )
2025-05-07T20:32:11.2950776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2951554Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.2951941Z 
2025-05-07T20:32:11.2952026Z     @given(
2025-05-07T20:32:11.2952264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2952572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2952882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2953219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2953821Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2954109Z     )
2025-05-07T20:32:11.2954590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2955039Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2955273Z         self,
2025-05-07T20:32:11.2955469Z         T: int,
2025-05-07T20:32:11.2955669Z         D: int,
2025-05-07T20:32:11.2955896Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2956164Z         contiguous: bool,
2025-05-07T20:32:11.2956404Z         compiled: bool,
2025-05-07T20:32:11.2956623Z     ) -> None:
2025-05-07T20:32:11.2956841Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2957086Z     
2025-05-07T20:32:11.2957351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2957691Z     
2025-05-07T20:32:11.2957889Z >       x_sign = torch.sign(x)
2025-05-07T20:32:11.2959813Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.2961668Z 
2025-05-07T20:32:11.2961785Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:11.2962000Z 
2025-05-07T20:32:11.2962101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.2962511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.2962904Z     T=1,
2025-05-07T20:32:11.2963079Z     D=7168,
2025-05-07T20:32:11.2963272Z     scale_ub=1200.0,
2025-05-07T20:32:11.2963497Z     contiguous=True,
2025-05-07T20:32:11.2963719Z     compiled=False,
2025-05-07T20:32:11.2963926Z )
2025-05-07T20:32:11.2964245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2964722Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.2964989Z 
2025-05-07T20:32:11.2965066Z     @given(
2025-05-07T20:32:11.2965296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2965608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2965906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2966232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2966557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2966836Z     )
2025-05-07T20:32:11.2967179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2967623Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2967859Z         self,
2025-05-07T20:32:11.2968051Z         T: int,
2025-05-07T20:32:11.2968251Z         D: int,
2025-05-07T20:32:11.2968465Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2968738Z         contiguous: bool,
2025-05-07T20:32:11.2968979Z         compiled: bool,
2025-05-07T20:32:11.2969195Z     ) -> None:
2025-05-07T20:32:11.2969410Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2969651Z     
2025-05-07T20:32:11.2969914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2970339Z     
2025-05-07T20:32:11.2970540Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.2970832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.2971136Z         x = x_sign * x_clamp
2025-05-07T20:32:11.2971384Z         x0 = x[:, :D]
2025-05-07T20:32:11.2971601Z         x1 = x[:, D:]
2025-05-07T20:32:11.2971808Z     
2025-05-07T20:32:11.2971998Z         if contiguous:
2025-05-07T20:32:11.2972236Z             x0 = x0.contiguous()
2025-05-07T20:32:11.2972488Z             x1 = x1.contiguous()
2025-05-07T20:32:11.2972809Z     
2025-05-07T20:32:11.2973009Z         if scale_ub is not None:
2025-05-07T20:32:11.2973282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.2973693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.2974053Z             )
2025-05-07T20:32:11.2974245Z         else:
2025-05-07T20:32:11.2974458Z             scale_ub_tensor = None
2025-05-07T20:32:11.2974847Z     
2025-05-07T20:32:11.2975220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2975611Z             op = silu_mul_quant
2025-05-07T20:32:11.2983916Z             if compiled:
2025-05-07T20:32:11.2984236Z                 op = torch.compile(op)
2025-05-07T20:32:11.2984543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2984834Z     
2025-05-07T20:32:11.2985046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.2985217Z 
2025-05-07T20:32:11.2985325Z moe/activation_test.py:117: 
2025-05-07T20:32:11.2985634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2985988Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.2986281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2986984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.2987695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.2988242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.2988930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.2989598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.2990528Z     kernel = self.compile(
2025-05-07T20:32:11.2991084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.2991757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2992171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2992404Z 
2025-05-07T20:32:11.2992623Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aab9cca0>
2025-05-07T20:32:11.2993706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.2995084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4ee60>}
2025-05-07T20:32:11.2996434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.2997482Z context = <triton._C.libtriton.ir.context object at 0x7f06aaeab270>
2025-05-07T20:32:11.2997774Z 
2025-05-07T20:32:11.2997964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.2998491Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2998978Z                            module_map=module_map)
2025-05-07T20:32:11.2999479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2999837Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.3000110Z E       ^
2025-05-07T20:32:11.3000590Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.3001045Z 
2025-05-07T20:32:11.3001475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.3001991Z 
2025-05-07T20:32:11.3002105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.3002607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.3003131Z     T=128,
2025-05-07T20:32:11.3003332Z     D=5120,
2025-05-07T20:32:11.3003547Z     scale_ub=None,
2025-05-07T20:32:11.3003775Z     contiguous=True,
2025-05-07T20:32:11.3004016Z     compiled=False,
2025-05-07T20:32:11.3004233Z )
2025-05-07T20:32:11.3775471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.3776262Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.3776651Z 
2025-05-07T20:32:11.3776744Z     @given(
2025-05-07T20:32:11.3776985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.3777294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.3777606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.3777944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.3778273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.3778568Z     )
2025-05-07T20:32:11.3778936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.3779370Z     def test_silu_mul_quant(
2025-05-07T20:32:11.3779615Z         self,
2025-05-07T20:32:11.3779882Z         T: int,
2025-05-07T20:32:11.3780087Z         D: int,
2025-05-07T20:32:11.3780305Z         scale_ub: Optional[float],
2025-05-07T20:32:11.3780581Z         contiguous: bool,
2025-05-07T20:32:11.3780822Z         compiled: bool,
2025-05-07T20:32:11.3781046Z     ) -> None:
2025-05-07T20:32:11.3781268Z         torch.manual_seed(2025)
2025-05-07T20:32:11.3781515Z     
2025-05-07T20:32:11.3781783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.3782125Z     
2025-05-07T20:32:11.3782320Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.3782612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.3782926Z         x = x_sign * x_clamp
2025-05-07T20:32:11.3783174Z         x0 = x[:, :D]
2025-05-07T20:32:11.3783393Z         x1 = x[:, D:]
2025-05-07T20:32:11.3783598Z     
2025-05-07T20:32:11.3783788Z         if contiguous:
2025-05-07T20:32:11.3784023Z             x0 = x0.contiguous()
2025-05-07T20:32:11.3784283Z             x1 = x1.contiguous()
2025-05-07T20:32:11.3784523Z     
2025-05-07T20:32:11.3784719Z         if scale_ub is not None:
2025-05-07T20:32:11.3784992Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.3785334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.3785648Z             )
2025-05-07T20:32:11.3785843Z         else:
2025-05-07T20:32:11.3786054Z             scale_ub_tensor = None
2025-05-07T20:32:11.3786307Z     
2025-05-07T20:32:11.3786540Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.3786851Z             op = silu_mul_quant
2025-05-07T20:32:11.3787105Z             if compiled:
2025-05-07T20:32:11.3787359Z                 op = torch.compile(op)
2025-05-07T20:32:11.3787655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.3787929Z     
2025-05-07T20:32:11.3788129Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.3788294Z 
2025-05-07T20:32:11.3788394Z moe/activation_test.py:117: 
2025-05-07T20:32:11.3788691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.3789024Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.3789557Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.3790514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.3791206Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.3791750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.3792422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.3793182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.3793839Z     kernel = self.compile(
2025-05-07T20:32:11.3794387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.3795032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.3795431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.3795655Z 
2025-05-07T20:32:11.3795870Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aae4e3e0>
2025-05-07T20:32:11.3796939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.3798310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4f7f0>}
2025-05-07T20:32:11.3799653Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.3800668Z context = <triton._C.libtriton.ir.context object at 0x7f06aaeea630>
2025-05-07T20:32:11.3800958Z 
2025-05-07T20:32:11.3801129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.3801642Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.3802107Z                            module_map=module_map)
2025-05-07T20:32:11.3802474Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.3802827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.3803082Z E       ^
2025-05-07T20:32:11.3803547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.3803992Z 
2025-05-07T20:32:11.3804427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.3804935Z 
2025-05-07T20:32:11.3805042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.3805458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.3805857Z     T=128,
2025-05-07T20:32:11.3806051Z     D=7168,
2025-05-07T20:32:11.3806241Z     scale_ub=None,
2025-05-07T20:32:11.3806459Z     contiguous=True,
2025-05-07T20:32:11.3806690Z     compiled=False,
2025-05-07T20:32:11.3806894Z )
2025-05-07T20:32:11.3807214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.3807704Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.3807971Z 
2025-05-07T20:32:11.3808051Z     @given(
2025-05-07T20:32:11.3808287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.3808605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.3808912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.3809243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.3809578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.3809866Z     )
2025-05-07T20:32:11.3810283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.3810726Z     def test_silu_mul_quant(
2025-05-07T20:32:11.3810965Z         self,
2025-05-07T20:32:11.3811159Z         T: int,
2025-05-07T20:32:11.3811363Z         D: int,
2025-05-07T20:32:11.3811586Z         scale_ub: Optional[float],
2025-05-07T20:32:11.3811852Z         contiguous: bool,
2025-05-07T20:32:11.3812094Z         compiled: bool,
2025-05-07T20:32:11.3812316Z     ) -> None:
2025-05-07T20:32:11.3812527Z         torch.manual_seed(2025)
2025-05-07T20:32:11.3812818Z     
2025-05-07T20:32:11.3813096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.3813435Z     
2025-05-07T20:32:11.3813710Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.3814007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.3814310Z         x = x_sign * x_clamp
2025-05-07T20:32:11.3814554Z         x0 = x[:, :D]
2025-05-07T20:32:11.3814777Z         x1 = x[:, D:]
2025-05-07T20:32:11.3814985Z     
2025-05-07T20:32:11.3815168Z         if contiguous:
2025-05-07T20:32:11.3815401Z             x0 = x0.contiguous()
2025-05-07T20:32:11.3815658Z             x1 = x1.contiguous()
2025-05-07T20:32:11.3815894Z     
2025-05-07T20:32:11.3816089Z         if scale_ub is not None:
2025-05-07T20:32:11.3816363Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.3816689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.3816997Z             )
2025-05-07T20:32:11.3817193Z         else:
2025-05-07T20:32:11.3817411Z             scale_ub_tensor = None
2025-05-07T20:32:11.3817663Z     
2025-05-07T20:32:11.3817899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.3818210Z             op = silu_mul_quant
2025-05-07T20:32:11.3818462Z             if compiled:
2025-05-07T20:32:11.3818711Z                 op = torch.compile(op)
2025-05-07T20:32:11.3819003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.3819288Z     
2025-05-07T20:32:11.3819484Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.3819650Z 
2025-05-07T20:32:11.3819755Z moe/activation_test.py:117: 
2025-05-07T20:32:11.3820129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.3820461Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.3820748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.3821427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.3822116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.3822656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.3823341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.3824007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.3824540Z     kernel = self.compile(
2025-05-07T20:32:11.3825079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.3825733Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.3826130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.3826362Z 
2025-05-07T20:32:11.3826572Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aaa03cd0>
2025-05-07T20:32:11.3827648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.3829021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaa98160>}
2025-05-07T20:32:11.3830423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.3831451Z context = <triton._C.libtriton.ir.context object at 0x7f06aaa6dc70>
2025-05-07T20:32:11.3831745Z 
2025-05-07T20:32:11.3831909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.3832430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.3832939Z                            module_map=module_map)
2025-05-07T20:32:11.3833407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.3833768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.3834033Z E       ^
2025-05-07T20:32:11.3834493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.3834945Z 
2025-05-07T20:32:11.3835363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.3835880Z 
2025-05-07T20:32:11.3835988Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.3836403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.3836797Z     T=2048,
2025-05-07T20:32:11.3836990Z     D=7168,
2025-05-07T20:32:11.3837186Z     scale_ub=1200.0,
2025-05-07T20:32:11.3837403Z     contiguous=True,
2025-05-07T20:32:11.3837632Z     compiled=False,
2025-05-07T20:32:11.3837837Z )
2025-05-07T20:32:11.4800079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4800725Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.4801002Z 
2025-05-07T20:32:11.4801084Z     @given(
2025-05-07T20:32:11.4801317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4801640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4801940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4802268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4802596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4802883Z     )
2025-05-07T20:32:11.4803222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4803662Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4803927Z         self,
2025-05-07T20:32:11.4804145Z         T: int,
2025-05-07T20:32:11.4804351Z         D: int,
2025-05-07T20:32:11.4804573Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4804842Z         contiguous: bool,
2025-05-07T20:32:11.4805088Z         compiled: bool,
2025-05-07T20:32:11.4805312Z     ) -> None:
2025-05-07T20:32:11.4805526Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4805770Z     
2025-05-07T20:32:11.4806043Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4808092Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.4809980Z 
2025-05-07T20:32:11.4810106Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.4810325Z 
2025-05-07T20:32:11.4810430Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4810845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4811240Z     T=1,
2025-05-07T20:32:11.4811422Z     D=5120,
2025-05-07T20:32:11.4811847Z     scale_ub=1200.0,
2025-05-07T20:32:11.4812073Z     contiguous=True,
2025-05-07T20:32:11.4812295Z     compiled=False,
2025-05-07T20:32:11.4812505Z )
2025-05-07T20:32:11.4812825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4813311Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.4813578Z 
2025-05-07T20:32:11.4813657Z     @given(
2025-05-07T20:32:11.4813890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4814205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4814596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4815046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4815384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4815668Z     )
2025-05-07T20:32:11.4816021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4816465Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4816702Z         self,
2025-05-07T20:32:11.4816898Z         T: int,
2025-05-07T20:32:11.4817104Z         D: int,
2025-05-07T20:32:11.4817322Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4817594Z         contiguous: bool,
2025-05-07T20:32:11.4817835Z         compiled: bool,
2025-05-07T20:32:11.4818057Z     ) -> None:
2025-05-07T20:32:11.4818268Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4818514Z     
2025-05-07T20:32:11.4818786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4819120Z     
2025-05-07T20:32:11.4819316Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.4819610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.4820001Z         x = x_sign * x_clamp
2025-05-07T20:32:11.4820243Z         x0 = x[:, :D]
2025-05-07T20:32:11.4820462Z         x1 = x[:, D:]
2025-05-07T20:32:11.4820665Z     
2025-05-07T20:32:11.4820851Z         if contiguous:
2025-05-07T20:32:11.4821086Z             x0 = x0.contiguous()
2025-05-07T20:32:11.4821337Z             x1 = x1.contiguous()
2025-05-07T20:32:11.4821582Z     
2025-05-07T20:32:11.4821783Z         if scale_ub is not None:
2025-05-07T20:32:11.4822049Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.4822381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.4822689Z             )
2025-05-07T20:32:11.4822881Z         else:
2025-05-07T20:32:11.4823098Z             scale_ub_tensor = None
2025-05-07T20:32:11.4823349Z     
2025-05-07T20:32:11.4823581Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.4823897Z             op = silu_mul_quant
2025-05-07T20:32:11.4824149Z             if compiled:
2025-05-07T20:32:11.4824403Z                 op = torch.compile(op)
2025-05-07T20:32:11.4824692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4824967Z     
2025-05-07T20:32:11.4825160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.4825325Z 
2025-05-07T20:32:11.4825427Z moe/activation_test.py:117: 
2025-05-07T20:32:11.4825727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4826057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.4826336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4827022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.4827713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.4828251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.4828931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.4829591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.4830122Z     kernel = self.compile(
2025-05-07T20:32:11.4830661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.4831359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.4831755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4831983Z 
2025-05-07T20:32:11.4832195Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aaa53be0>
2025-05-07T20:32:11.4833265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.4834756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaa98940>}
2025-05-07T20:32:11.4836094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.4837128Z context = <triton._C.libtriton.ir.context object at 0x7f06aaa54170>
2025-05-07T20:32:11.4837413Z 
2025-05-07T20:32:11.4837584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.4838098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.4838563Z                            module_map=module_map)
2025-05-07T20:32:11.4838931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.4839275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.4839534Z E       ^
2025-05-07T20:32:11.4840001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.4840444Z 
2025-05-07T20:32:11.4840864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.4841386Z 
2025-05-07T20:32:11.4841491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4841903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4842306Z     T=2048,
2025-05-07T20:32:11.4842493Z     D=5120,
2025-05-07T20:32:11.4842685Z     scale_ub=None,
2025-05-07T20:32:11.4842901Z     contiguous=True,
2025-05-07T20:32:11.4843129Z     compiled=False,
2025-05-07T20:32:11.4843332Z )
2025-05-07T20:32:11.4843652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4844148Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.4844421Z 
2025-05-07T20:32:11.4844498Z     @given(
2025-05-07T20:32:11.4844731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4845044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4845344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4845677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4846005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4846287Z     )
2025-05-07T20:32:11.4846634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4847072Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4847313Z         self,
2025-05-07T20:32:11.4847504Z         T: int,
2025-05-07T20:32:11.4847705Z         D: int,
2025-05-07T20:32:11.4847929Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4848198Z         contiguous: bool,
2025-05-07T20:32:11.4848440Z         compiled: bool,
2025-05-07T20:32:11.4848666Z     ) -> None:
2025-05-07T20:32:11.4848883Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4849123Z     
2025-05-07T20:32:11.4849397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4849732Z     
2025-05-07T20:32:11.4849926Z >       x_sign = torch.sign(x)
2025-05-07T20:32:11.4851924Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.4853836Z 
2025-05-07T20:32:11.4853981Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:11.4854215Z 
2025-05-07T20:32:11.4854397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4854809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4855210Z     T=16384,
2025-05-07T20:32:11.4855408Z     D=5120,
2025-05-07T20:32:11.4855595Z     scale_ub=None,
2025-05-07T20:32:11.4855810Z     contiguous=True,
2025-05-07T20:32:11.4856033Z     compiled=False,
2025-05-07T20:32:11.4856229Z )
2025-05-07T20:32:11.5811161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5812301Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.5812885Z 
2025-05-07T20:32:11.5813021Z     @given(
2025-05-07T20:32:11.5813379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5813828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5814206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5814545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5814889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5815180Z     )
2025-05-07T20:32:11.5815526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5815971Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5816219Z         self,
2025-05-07T20:32:11.5816430Z         T: int,
2025-05-07T20:32:11.5816634Z         D: int,
2025-05-07T20:32:11.5816854Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5817129Z         contiguous: bool,
2025-05-07T20:32:11.5817516Z         compiled: bool,
2025-05-07T20:32:11.5817801Z     ) -> None:
2025-05-07T20:32:11.5825827Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5826133Z     
2025-05-07T20:32:11.5826426Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5828507Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5830374Z 
2025-05-07T20:32:11.5830509Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.5830724Z 
2025-05-07T20:32:11.5830833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5831260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5831677Z     T=4096,
2025-05-07T20:32:11.5831878Z     D=5120,
2025-05-07T20:32:11.5832075Z     scale_ub=None,
2025-05-07T20:32:11.5832306Z     contiguous=True,
2025-05-07T20:32:11.5832543Z     compiled=False,
2025-05-07T20:32:11.5832762Z )
2025-05-07T20:32:11.5833093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5833595Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.5833869Z 
2025-05-07T20:32:11.5833951Z     @given(
2025-05-07T20:32:11.5834193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5834712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5835021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5835358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5835699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5835992Z     )
2025-05-07T20:32:11.5836349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5836798Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5837052Z         self,
2025-05-07T20:32:11.5837340Z         T: int,
2025-05-07T20:32:11.5837548Z         D: int,
2025-05-07T20:32:11.5837786Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5838184Z         contiguous: bool,
2025-05-07T20:32:11.5838443Z         compiled: bool,
2025-05-07T20:32:11.5838684Z     ) -> None:
2025-05-07T20:32:11.5838908Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5839162Z     
2025-05-07T20:32:11.5839447Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5841533Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5843379Z 
2025-05-07T20:32:11.5843518Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.5843732Z 
2025-05-07T20:32:11.5843844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5844272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5844679Z     T=2048,
2025-05-07T20:32:11.5844869Z     D=5120,
2025-05-07T20:32:11.5845076Z     scale_ub=None,
2025-05-07T20:32:11.5845303Z     contiguous=False,
2025-05-07T20:32:11.5845539Z     compiled=False,
2025-05-07T20:32:11.5845761Z )
2025-05-07T20:32:11.5846096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5846597Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.5846883Z 
2025-05-07T20:32:11.5846966Z     @given(
2025-05-07T20:32:11.5847216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5847547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5847865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5848215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5848563Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5848861Z     )
2025-05-07T20:32:11.5849228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5849691Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5849940Z         self,
2025-05-07T20:32:11.5850156Z         T: int,
2025-05-07T20:32:11.5850372Z         D: int,
2025-05-07T20:32:11.5850603Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5850890Z         contiguous: bool,
2025-05-07T20:32:11.5851146Z         compiled: bool,
2025-05-07T20:32:11.5851384Z     ) -> None:
2025-05-07T20:32:11.5851610Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5851872Z     
2025-05-07T20:32:11.5852159Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5854208Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5856137Z 
2025-05-07T20:32:11.5856267Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.5856481Z 
2025-05-07T20:32:11.5856594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5857028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5857438Z     T=4096,
2025-05-07T20:32:11.5857633Z     D=7168,
2025-05-07T20:32:11.5857886Z     scale_ub=None,
2025-05-07T20:32:11.5858116Z     contiguous=True,
2025-05-07T20:32:11.5858345Z     compiled=True,
2025-05-07T20:32:11.5858640Z )
2025-05-07T20:32:11.5858973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5859475Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.5859750Z 
2025-05-07T20:32:11.5859935Z     @given(
2025-05-07T20:32:11.5860205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5860533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5860847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5861194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5861542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5861833Z     )
2025-05-07T20:32:11.5862194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5862653Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5862912Z         self,
2025-05-07T20:32:11.5863117Z         T: int,
2025-05-07T20:32:11.5863330Z         D: int,
2025-05-07T20:32:11.5863561Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5863865Z         contiguous: bool,
2025-05-07T20:32:11.5864139Z         compiled: bool,
2025-05-07T20:32:11.5864373Z     ) -> None:
2025-05-07T20:32:11.5864596Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5864855Z     
2025-05-07T20:32:11.5865135Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5867180Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5869037Z 
2025-05-07T20:32:11.5869161Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.5869384Z 
2025-05-07T20:32:11.5869496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5869925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5870343Z     T=2048,
2025-05-07T20:32:11.5870538Z     D=5120,
2025-05-07T20:32:11.5870745Z     scale_ub=1200.0,
2025-05-07T20:32:11.5870979Z     contiguous=False,
2025-05-07T20:32:11.5871216Z     compiled=False,
2025-05-07T20:32:11.5871434Z )
2025-05-07T20:32:11.5871763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5872263Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.5872547Z 
2025-05-07T20:32:11.5872629Z     @given(
2025-05-07T20:32:11.5872874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5873194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5873511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5873855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5874201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5874494Z     )
2025-05-07T20:32:11.5874908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5875367Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5875616Z         self,
2025-05-07T20:32:11.5875825Z         T: int,
2025-05-07T20:32:11.5876039Z         D: int,
2025-05-07T20:32:11.5876266Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5876549Z         contiguous: bool,
2025-05-07T20:32:11.5876808Z         compiled: bool,
2025-05-07T20:32:11.5877039Z     ) -> None:
2025-05-07T20:32:11.5877270Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5877574Z     
2025-05-07T20:32:11.5877855Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5880053Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5881903Z 
2025-05-07T20:32:11.5882026Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.5882248Z 
2025-05-07T20:32:11.5882360Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5882785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5883200Z     T=4096,
2025-05-07T20:32:11.5883402Z     D=7168,
2025-05-07T20:32:11.5883608Z     scale_ub=1200.0,
2025-05-07T20:32:11.5883841Z     contiguous=True,
2025-05-07T20:32:11.5884077Z     compiled=False,
2025-05-07T20:32:11.5884297Z )
2025-05-07T20:32:11.7139580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7140421Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.7140831Z 
2025-05-07T20:32:11.7140940Z     @given(
2025-05-07T20:32:11.7141197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7141502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7141809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7142139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7142467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7142744Z     )
2025-05-07T20:32:11.7143096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7143538Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7143778Z         self,
2025-05-07T20:32:11.7143983Z         T: int,
2025-05-07T20:32:11.7144181Z         D: int,
2025-05-07T20:32:11.7144397Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7144666Z         contiguous: bool,
2025-05-07T20:32:11.7144905Z         compiled: bool,
2025-05-07T20:32:11.7145128Z     ) -> None:
2025-05-07T20:32:11.7145346Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7145590Z     
2025-05-07T20:32:11.7145857Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7147950Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7149801Z 
2025-05-07T20:32:11.7149925Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7150132Z 
2025-05-07T20:32:11.7150239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7150912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7151312Z     T=16384,
2025-05-07T20:32:11.7151509Z     D=7168,
2025-05-07T20:32:11.7151694Z     scale_ub=None,
2025-05-07T20:32:11.7151911Z     contiguous=False,
2025-05-07T20:32:11.7152140Z     compiled=True,
2025-05-07T20:32:11.7152338Z )
2025-05-07T20:32:11.7152664Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7153156Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.7153519Z 
2025-05-07T20:32:11.7153597Z     @given(
2025-05-07T20:32:11.7153827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7154302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7154616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7154938Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7155263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7155551Z     )
2025-05-07T20:32:11.7155893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7156332Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7156572Z         self,
2025-05-07T20:32:11.7156780Z         T: int,
2025-05-07T20:32:11.7156975Z         D: int,
2025-05-07T20:32:11.7157199Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7157472Z         contiguous: bool,
2025-05-07T20:32:11.7157707Z         compiled: bool,
2025-05-07T20:32:11.7157934Z     ) -> None:
2025-05-07T20:32:11.7158160Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7158403Z     
2025-05-07T20:32:11.7158682Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7160711Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7162553Z 
2025-05-07T20:32:11.7162678Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7162888Z 
2025-05-07T20:32:11.7162999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7163408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7163808Z     T=4096,
2025-05-07T20:32:11.7164005Z     D=7168,
2025-05-07T20:32:11.7164192Z     scale_ub=None,
2025-05-07T20:32:11.7164408Z     contiguous=True,
2025-05-07T20:32:11.7164640Z     compiled=False,
2025-05-07T20:32:11.7164840Z )
2025-05-07T20:32:11.7165168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7165662Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.7165926Z 
2025-05-07T20:32:11.7166015Z     @given(
2025-05-07T20:32:11.7166240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7166551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7166857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7167179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7167506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7167794Z     )
2025-05-07T20:32:11.7168136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7168582Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7168825Z         self,
2025-05-07T20:32:11.7169018Z         T: int,
2025-05-07T20:32:11.7169214Z         D: int,
2025-05-07T20:32:11.7169434Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7169699Z         contiguous: bool,
2025-05-07T20:32:11.7169996Z         compiled: bool,
2025-05-07T20:32:11.7170221Z     ) -> None:
2025-05-07T20:32:11.7170436Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7170674Z     
2025-05-07T20:32:11.7170946Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7173083Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7174998Z 
2025-05-07T20:32:11.7175123Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7175338Z 
2025-05-07T20:32:11.7175443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7175858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7176258Z     T=16384,
2025-05-07T20:32:11.7176454Z     D=7168,
2025-05-07T20:32:11.7176640Z     scale_ub=None,
2025-05-07T20:32:11.7176859Z     contiguous=True,
2025-05-07T20:32:11.7177085Z     compiled=False,
2025-05-07T20:32:11.7177285Z )
2025-05-07T20:32:11.7177600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7178095Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.7178367Z 
2025-05-07T20:32:11.7178446Z     @given(
2025-05-07T20:32:11.7178684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7178996Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7179296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7179625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7180051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7180333Z     )
2025-05-07T20:32:11.7180676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7181116Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7181360Z         self,
2025-05-07T20:32:11.7181551Z         T: int,
2025-05-07T20:32:11.7181750Z         D: int,
2025-05-07T20:32:11.7181972Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7182239Z         contiguous: bool,
2025-05-07T20:32:11.7182484Z         compiled: bool,
2025-05-07T20:32:11.7182706Z     ) -> None:
2025-05-07T20:32:11.7182917Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7183166Z     
2025-05-07T20:32:11.7183436Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7185457Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7187290Z 
2025-05-07T20:32:11.7187417Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7187625Z 
2025-05-07T20:32:11.7187735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7188144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7188549Z     T=16384,
2025-05-07T20:32:11.7188734Z     D=7168,
2025-05-07T20:32:11.7188923Z     scale_ub=1200.0,
2025-05-07T20:32:11.7189145Z     contiguous=True,
2025-05-07T20:32:11.7189359Z     compiled=False,
2025-05-07T20:32:11.7189560Z )
2025-05-07T20:32:11.7190212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7190705Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.7190983Z 
2025-05-07T20:32:11.7191059Z     @given(
2025-05-07T20:32:11.7191286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7191595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7191893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7192221Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7192633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7192911Z     )
2025-05-07T20:32:11.7193369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7193810Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7194048Z         self,
2025-05-07T20:32:11.7194241Z         T: int,
2025-05-07T20:32:11.7194438Z         D: int,
2025-05-07T20:32:11.7194656Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7194926Z         contiguous: bool,
2025-05-07T20:32:11.7195171Z         compiled: bool,
2025-05-07T20:32:11.7195393Z     ) -> None:
2025-05-07T20:32:11.7195608Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7195847Z     
2025-05-07T20:32:11.7196122Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7198141Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7199985Z 
2025-05-07T20:32:11.7200103Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7200316Z 
2025-05-07T20:32:11.7200423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7200831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7201226Z     T=128,
2025-05-07T20:32:11.7201406Z     D=5120,
2025-05-07T20:32:11.7201599Z     scale_ub=1200.0,
2025-05-07T20:32:11.7201823Z     contiguous=False,
2025-05-07T20:32:11.7202040Z     compiled=False,
2025-05-07T20:32:11.7202244Z )
2025-05-07T20:32:12.0571707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.0572479Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.0572881Z 
2025-05-07T20:32:12.0573022Z     @given(
2025-05-07T20:32:12.0573349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.0573703Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.0574009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.0574359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.0574691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.0574974Z     )
2025-05-07T20:32:12.0575326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.0575764Z     def test_silu_mul_quant(
2025-05-07T20:32:12.0576009Z         self,
2025-05-07T20:32:12.0576207Z         T: int,
2025-05-07T20:32:12.0576409Z         D: int,
2025-05-07T20:32:12.0576628Z         scale_ub: Optional[float],
2025-05-07T20:32:12.0576907Z         contiguous: bool,
2025-05-07T20:32:12.0577148Z         compiled: bool,
2025-05-07T20:32:12.0577375Z     ) -> None:
2025-05-07T20:32:12.0577604Z         torch.manual_seed(2025)
2025-05-07T20:32:12.0577857Z     
2025-05-07T20:32:12.0578132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.0578471Z     
2025-05-07T20:32:12.0578666Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.0579188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.0579492Z         x = x_sign * x_clamp
2025-05-07T20:32:12.0579736Z         x0 = x[:, :D]
2025-05-07T20:32:12.0580055Z         x1 = x[:, D:]
2025-05-07T20:32:12.0580257Z     
2025-05-07T20:32:12.0580447Z         if contiguous:
2025-05-07T20:32:12.0580683Z             x0 = x0.contiguous()
2025-05-07T20:32:12.0580939Z             x1 = x1.contiguous()
2025-05-07T20:32:12.0581178Z     
2025-05-07T20:32:12.0581377Z         if scale_ub is not None:
2025-05-07T20:32:12.0581742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.0582082Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.0582518Z             )
2025-05-07T20:32:12.0582711Z         else:
2025-05-07T20:32:12.0582927Z             scale_ub_tensor = None
2025-05-07T20:32:12.0583179Z     
2025-05-07T20:32:12.0583407Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.0583728Z             op = silu_mul_quant
2025-05-07T20:32:12.0583982Z             if compiled:
2025-05-07T20:32:12.0584233Z                 op = torch.compile(op)
2025-05-07T20:32:12.0584526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0584796Z     
2025-05-07T20:32:12.0584992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.0585155Z 
2025-05-07T20:32:12.0585259Z moe/activation_test.py:117: 
2025-05-07T20:32:12.0585559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0585894Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.0586176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0586876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.0587575Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.0588112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.0588787Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.0589449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.0590238Z     kernel = self.compile(
2025-05-07T20:32:12.0590779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.0591438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.0591846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0592076Z 
2025-05-07T20:32:12.0592299Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aa827b20>
2025-05-07T20:32:12.0593368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.0594807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aa858940>}
2025-05-07T20:32:12.0596140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.0597174Z context = <triton._C.libtriton.ir.context object at 0x7f06aa8908b0>
2025-05-07T20:32:12.0597462Z 
2025-05-07T20:32:12.0597655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.0598180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.0598644Z                            module_map=module_map)
2025-05-07T20:32:12.0599013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.0599443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.0599701Z E       ^
2025-05-07T20:32:12.0600286Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.0600760Z 
2025-05-07T20:32:12.0608260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.0608828Z 
2025-05-07T20:32:12.0608941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.0609369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.0609902Z     T=2048,
2025-05-07T20:32:12.0610096Z     D=7168,
2025-05-07T20:32:12.0610408Z     scale_ub=None,
2025-05-07T20:32:12.0610641Z     contiguous=False,
2025-05-07T20:32:12.0610876Z     compiled=False,
2025-05-07T20:32:12.0611094Z )
2025-05-07T20:32:12.0611422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.0611925Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.0612207Z 
2025-05-07T20:32:12.0612287Z     @given(
2025-05-07T20:32:12.0612531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.0612845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.0613163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.0613507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.0613845Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.0614160Z     )
2025-05-07T20:32:12.0614544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.0615002Z     def test_silu_mul_quant(
2025-05-07T20:32:12.0615250Z         self,
2025-05-07T20:32:12.0615461Z         T: int,
2025-05-07T20:32:12.0615668Z         D: int,
2025-05-07T20:32:12.0615890Z         scale_ub: Optional[float],
2025-05-07T20:32:12.0616172Z         contiguous: bool,
2025-05-07T20:32:12.0616423Z         compiled: bool,
2025-05-07T20:32:12.0616653Z     ) -> None:
2025-05-07T20:32:12.0616879Z         torch.manual_seed(2025)
2025-05-07T20:32:12.0617131Z     
2025-05-07T20:32:12.0617411Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.0619479Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.0621423Z 
2025-05-07T20:32:12.0621548Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.0621769Z 
2025-05-07T20:32:12.0621879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.0622300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.0622707Z     T=128,
2025-05-07T20:32:12.0622905Z     D=7168,
2025-05-07T20:32:12.0623108Z     scale_ub=1200.0,
2025-05-07T20:32:12.0623333Z     contiguous=True,
2025-05-07T20:32:12.0623569Z     compiled=True,
2025-05-07T20:32:12.0623781Z )
2025-05-07T20:32:12.1039304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1040424Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.1041008Z 
2025-05-07T20:32:12.1041168Z     @given(
2025-05-07T20:32:12.1041658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1042134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1042608Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1043123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1043829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1044175Z     )
2025-05-07T20:32:12.1044564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1045013Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1045261Z         self,
2025-05-07T20:32:12.1045470Z         T: int,
2025-05-07T20:32:12.1045680Z         D: int,
2025-05-07T20:32:12.1045905Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1046214Z         contiguous: bool,
2025-05-07T20:32:12.1046468Z         compiled: bool,
2025-05-07T20:32:12.1046785Z     ) -> None:
2025-05-07T20:32:12.1047007Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1047265Z     
2025-05-07T20:32:12.1047660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1048010Z     
2025-05-07T20:32:12.1048209Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1048513Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1048818Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1049064Z         x0 = x[:, :D]
2025-05-07T20:32:12.1049285Z         x1 = x[:, D:]
2025-05-07T20:32:12.1049488Z     
2025-05-07T20:32:12.1049680Z         if contiguous:
2025-05-07T20:32:12.1049918Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1050174Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1050419Z     
2025-05-07T20:32:12.1050619Z         if scale_ub is not None:
2025-05-07T20:32:12.1050890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1051226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1051539Z             )
2025-05-07T20:32:12.1051733Z         else:
2025-05-07T20:32:12.1051947Z             scale_ub_tensor = None
2025-05-07T20:32:12.1052210Z     
2025-05-07T20:32:12.1052441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1052755Z             op = silu_mul_quant
2025-05-07T20:32:12.1053014Z             if compiled:
2025-05-07T20:32:12.1053265Z                 op = torch.compile(op)
2025-05-07T20:32:12.1053564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1053840Z     
2025-05-07T20:32:12.1054043Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1054206Z 
2025-05-07T20:32:12.1054307Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1054610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1054943Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1055221Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1055780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.1056342Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.1057009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1057690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1058224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1058909Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1059563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1060235Z     kernel = self.compile(
2025-05-07T20:32:12.1060778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1061432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1061825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1062064Z 
2025-05-07T20:32:12.1062274Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aa7ad810>
2025-05-07T20:32:12.1063363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1064807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aa858dc0>}
2025-05-07T20:32:12.1066132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1067210Z context = <triton._C.libtriton.ir.context object at 0x7f06aa7d9b70>
2025-05-07T20:32:12.1067503Z 
2025-05-07T20:32:12.1067743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1068270Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1068732Z                            module_map=module_map)
2025-05-07T20:32:12.1069107Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1069462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1069723Z E       ^
2025-05-07T20:32:12.1070182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1070630Z 
2025-05-07T20:32:12.1071044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1071550Z 
2025-05-07T20:32:12.1071660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1072071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1072475Z     T=128,
2025-05-07T20:32:12.1072671Z     D=7168,
2025-05-07T20:32:12.1072870Z     scale_ub=1200.0,
2025-05-07T20:32:12.1073091Z     contiguous=True,
2025-05-07T20:32:12.1073319Z     compiled=False,
2025-05-07T20:32:12.1073526Z )
2025-05-07T20:32:12.1073842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1074342Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.1074611Z 
2025-05-07T20:32:12.1074699Z     @given(
2025-05-07T20:32:12.1074931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1075247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1075562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1075885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1076220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1076517Z     )
2025-05-07T20:32:12.1076877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1077314Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1077558Z         self,
2025-05-07T20:32:12.1077756Z         T: int,
2025-05-07T20:32:12.1077950Z         D: int,
2025-05-07T20:32:12.1078176Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1078454Z         contiguous: bool,
2025-05-07T20:32:12.1078690Z         compiled: bool,
2025-05-07T20:32:12.1078924Z     ) -> None:
2025-05-07T20:32:12.1079143Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1079383Z     
2025-05-07T20:32:12.1079658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1080002Z     
2025-05-07T20:32:12.1080191Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1080487Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1082487Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.1084429Z 
2025-05-07T20:32:12.1084553Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.1084765Z 
2025-05-07T20:32:12.1084878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1085288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1085690Z     T=128,
2025-05-07T20:32:12.1085884Z     D=5120,
2025-05-07T20:32:12.1086073Z     scale_ub=1200.0,
2025-05-07T20:32:12.1086296Z     contiguous=True,
2025-05-07T20:32:12.1086565Z     compiled=True,
2025-05-07T20:32:12.1086768Z )
2025-05-07T20:32:12.1087196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1087691Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.1087956Z 
2025-05-07T20:32:12.1088039Z     @given(
2025-05-07T20:32:12.1088269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1088588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1088895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1089222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1089556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1090173Z     )
2025-05-07T20:32:12.1090530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1090972Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1091222Z         self,
2025-05-07T20:32:12.1091428Z         T: int,
2025-05-07T20:32:12.1091623Z         D: int,
2025-05-07T20:32:12.1091848Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1092128Z         contiguous: bool,
2025-05-07T20:32:12.1092367Z         compiled: bool,
2025-05-07T20:32:12.1092597Z     ) -> None:
2025-05-07T20:32:12.1092818Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1093056Z     
2025-05-07T20:32:12.1093335Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1093683Z     
2025-05-07T20:32:12.1093877Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1094173Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1096208Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.1098032Z 
2025-05-07T20:32:12.1098160Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.1098396Z 
2025-05-07T20:32:12.1098510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1098927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1099328Z     T=128,
2025-05-07T20:32:12.1099521Z     D=7168,
2025-05-07T20:32:12.1099718Z     scale_ub=None,
2025-05-07T20:32:12.1100005Z     contiguous=True,
2025-05-07T20:32:12.1100232Z     compiled=True,
2025-05-07T20:32:12.1100436Z )
2025-05-07T20:32:12.3543886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3544579Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.3544871Z 
2025-05-07T20:32:12.3544951Z     @given(
2025-05-07T20:32:12.3545190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3545511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3545822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3546157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3546481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3547039Z     )
2025-05-07T20:32:12.3547391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3547824Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3548071Z         self,
2025-05-07T20:32:12.3548267Z         T: int,
2025-05-07T20:32:12.3548462Z         D: int,
2025-05-07T20:32:12.3548687Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3548963Z         contiguous: bool,
2025-05-07T20:32:12.3549208Z         compiled: bool,
2025-05-07T20:32:12.3549430Z     ) -> None:
2025-05-07T20:32:12.3549739Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3549982Z     
2025-05-07T20:32:12.3550381Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3552410Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.3554254Z 
2025-05-07T20:32:12.3554375Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.3554589Z 
2025-05-07T20:32:12.3586340Z FAILED
2025-05-07T20:32:12.3586469Z 
2025-05-07T20:32:12.3586646Z =================================== FAILURES ===================================
2025-05-07T20:32:12.3587274Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:12.3587900Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:12.3588751Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:12.3589516Z   |     yield
2025-05-07T20:32:12.3590452Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:12.3591191Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:12.3591972Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:12.3592715Z   |     method()
2025-05-07T20:32:12.3593590Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:12.3594612Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3595506Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:12.3596353Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:12.3597037Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:12.3597706Z   +-+---------------- 1 ----------------
2025-05-07T20:32:12.3598115Z     | Traceback (most recent call last):
2025-05-07T20:32:12.3599083Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:12.3600165Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3603021Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.3605935Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:12.3606543Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3607104Z     |     T=2048,
2025-05-07T20:32:12.3607418Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:12.3607880Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:12.3608372Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:12.3608874Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:12.3609381Z     | )
2025-05-07T20:32:12.3609621Z     | 
2025-05-07T20:32:12.3610475Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:12.3611327Z     +---------------- 2 ----------------
2025-05-07T20:32:12.3611742Z     | Traceback (most recent call last):
2025-05-07T20:32:12.3612755Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:12.3613844Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3616700Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.3618756Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:12.3620169Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3620747Z     |     T=128,
2025-05-07T20:32:12.3621016Z     |     D=7168,
2025-05-07T20:32:12.3621300Z     |     scale_ub=None,
2025-05-07T20:32:12.3621631Z     |     contiguous=True,
2025-05-07T20:32:12.3621957Z     |     compiled=True,
2025-05-07T20:32:12.3622262Z     | )
2025-05-07T20:32:12.3622506Z     | 
2025-05-07T20:32:12.3623225Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:12.3623876Z     +---------------- 3 ----------------
2025-05-07T20:32:12.3624177Z     | Traceback (most recent call last):
2025-05-07T20:32:12.3624890Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:12.3625660Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3627697Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.3629692Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:12.3630140Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3630542Z     |     T=128,
2025-05-07T20:32:12.3630747Z     |     D=5120,
2025-05-07T20:32:12.3630962Z     |     scale_ub=1200.0,
2025-05-07T20:32:12.3631210Z     |     contiguous=True,
2025-05-07T20:32:12.3631446Z     |     compiled=True,
2025-05-07T20:32:12.3631676Z     | )
2025-05-07T20:32:12.3631934Z     | 
2025-05-07T20:32:12.3632452Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:12.3633059Z     +---------------- 4 ----------------
2025-05-07T20:32:12.3633352Z     | Traceback (most recent call last):
2025-05-07T20:32:12.3634053Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:12.3634759Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.3635531Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:12.3636229Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3637065Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:12.3637864Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.3638474Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:12.3639202Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3639934Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:12.3640702Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3641506Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:12.3642302Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3643071Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:12.3643760Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.3644409Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:12.3644969Z     |     fn()
2025-05-07T20:32:12.3645530Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:12.3646300Z     |     self.fn.run(
2025-05-07T20:32:12.3647037Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:12.3647830Z     |     kernel = self.compile(
2025-05-07T20:32:12.3648663Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:12.3649650Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3650639Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:12.3651723Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3652445Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3652934Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.3653305Z     | ^
2025-05-07T20:32:12.3653934Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3654725Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:12.3655284Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:12.3656001Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3656678Z     |     T=1,  # or any other generated value
2025-05-07T20:32:12.3657120Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:12.3657595Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:12.3658099Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:12.3658624Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:12.3659038Z     | )
2025-05-07T20:32:12.3659298Z     | 
2025-05-07T20:32:12.3660190Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:12.3681615Z     +------------------------------------
2025-05-07T20:32:12.3682137Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:12.3682654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3683237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3683807Z     T=1,
2025-05-07T20:32:12.3684068Z     D=5120,
2025-05-07T20:32:12.3684345Z     scale_ub=None,
2025-05-07T20:32:12.3684657Z     contiguous=True,
2025-05-07T20:32:12.3684975Z     compiled=True,
2025-05-07T20:32:12.3685277Z )
2025-05-07T20:32:12.3685735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3686400Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.3686766Z 
2025-05-07T20:32:12.3686876Z     @given(
2025-05-07T20:32:12.3687199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3687632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3688060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3688503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3688953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3689332Z     )
2025-05-07T20:32:12.3689805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3690715Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3691040Z         self,
2025-05-07T20:32:12.3691307Z         T: int,
2025-05-07T20:32:12.3691580Z         D: int,
2025-05-07T20:32:12.3691867Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3692238Z         contiguous: bool,
2025-05-07T20:32:12.3692562Z         compiled: bool,
2025-05-07T20:32:12.3692859Z     ) -> None:
2025-05-07T20:32:12.3693148Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3693480Z     
2025-05-07T20:32:12.3693872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3694390Z     
2025-05-07T20:32:12.3694658Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3695041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3695444Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3695784Z         x0 = x[:, :D]
2025-05-07T20:32:12.3696094Z         x1 = x[:, D:]
2025-05-07T20:32:12.3696385Z     
2025-05-07T20:32:12.3696646Z         if contiguous:
2025-05-07T20:32:12.3696967Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3697308Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3697632Z     
2025-05-07T20:32:12.3697894Z         if scale_ub is not None:
2025-05-07T20:32:12.3698263Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3698733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3699165Z             )
2025-05-07T20:32:12.3699426Z         else:
2025-05-07T20:32:12.3699709Z             scale_ub_tensor = None
2025-05-07T20:32:12.3700190Z     
2025-05-07T20:32:12.3700506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3700950Z             op = silu_mul_quant
2025-05-07T20:32:12.3701309Z             if compiled:
2025-05-07T20:32:12.3701670Z                 op = torch.compile(op)
2025-05-07T20:32:12.3702080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3702481Z     
2025-05-07T20:32:12.3702947Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.3703336Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.3703738Z     
2025-05-07T20:32:12.3704063Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3704495Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.3704886Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.3705302Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.3705774Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3706282Z     
2025-05-07T20:32:12.3706557Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.3706815Z 
2025-05-07T20:32:12.3707093Z moe/activation_test.py:126: 
2025-05-07T20:32:12.3707483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3707934Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.3708395Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3709453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.3710473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.3711189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3712098Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3713058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.3714084Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3715133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.3716155Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3717152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.3718034Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.3718847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.3719568Z     fn()
2025-05-07T20:32:12.3720270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.3721071Z     self.fn.run(
2025-05-07T20:32:12.3721714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3722432Z     kernel = self.compile(
2025-05-07T20:32:12.3723130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3724018Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3724567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3724888Z 
2025-05-07T20:32:12.3725178Z self = <triton.compiler.compiler.ASTSource object at 0x7f07cf4c4040>
2025-05-07T20:32:12.3726637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3728502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07cfc6f400>}
2025-05-07T20:32:12.3730394Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3731880Z context = <triton._C.libtriton.ir.context object at 0x7f07cfe65130>
2025-05-07T20:32:12.3732290Z 
2025-05-07T20:32:12.3732537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3733274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3733951Z                            module_map=module_map)
2025-05-07T20:32:12.3734453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3734920Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.3735345Z E       ^
2025-05-07T20:32:12.3736061Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3736675Z 
2025-05-07T20:32:12.3737240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3737932Z 
2025-05-07T20:32:12.3738069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3738657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3739226Z     T=2048,
2025-05-07T20:32:12.3739485Z     D=5120,
2025-05-07T20:32:12.3739759Z     scale_ub=1200.0,
2025-05-07T20:32:12.3740198Z     contiguous=True,
2025-05-07T20:32:12.3740500Z     compiled=False,
2025-05-07T20:32:12.3740769Z )
2025-05-07T20:32:12.3741185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3741820Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.3742183Z 
2025-05-07T20:32:12.3742285Z     @given(
2025-05-07T20:32:12.3742589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3743014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3743409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3743847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3744286Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3744668Z     )
2025-05-07T20:32:12.3745143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3745727Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3746057Z         self,
2025-05-07T20:32:12.3746306Z         T: int,
2025-05-07T20:32:12.3746568Z         D: int,
2025-05-07T20:32:12.3746876Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3747257Z         contiguous: bool,
2025-05-07T20:32:12.3747601Z         compiled: bool,
2025-05-07T20:32:12.3747923Z     ) -> None:
2025-05-07T20:32:12.3748224Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3748571Z     
2025-05-07T20:32:12.3748958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3749430Z     
2025-05-07T20:32:12.3749702Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3750114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3750544Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3750891Z         x0 = x[:, :D]
2025-05-07T20:32:12.3751200Z         x1 = x[:, D:]
2025-05-07T20:32:12.3751470Z     
2025-05-07T20:32:12.3751722Z         if contiguous:
2025-05-07T20:32:12.3752038Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3752391Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3752709Z     
2025-05-07T20:32:12.3752969Z         if scale_ub is not None:
2025-05-07T20:32:12.3753335Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3753784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3754201Z             )
2025-05-07T20:32:12.3754462Z         else:
2025-05-07T20:32:12.3754741Z             scale_ub_tensor = None
2025-05-07T20:32:12.3755088Z     
2025-05-07T20:32:12.3755399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3755814Z             op = silu_mul_quant
2025-05-07T20:32:12.3756155Z             if compiled:
2025-05-07T20:32:12.3756496Z                 op = torch.compile(op)
2025-05-07T20:32:12.3757025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3757409Z     
2025-05-07T20:32:12.3757685Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3757918Z 
2025-05-07T20:32:12.3758063Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3758460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3758907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3759289Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3760215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3761225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3762059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3763017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3763960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3764713Z     kernel = self.compile(
2025-05-07T20:32:12.3765484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3767747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3768287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3768609Z 
2025-05-07T20:32:12.3768896Z self = <triton.compiler.compiler.ASTSource object at 0x7f07cfaa5300>
2025-05-07T20:32:12.3770396Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3772303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07cf43eef0>}
2025-05-07T20:32:12.3774172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3775578Z context = <triton._C.libtriton.ir.context object at 0x7f07ce545570>
2025-05-07T20:32:12.3775976Z 
2025-05-07T20:32:12.3776218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3776950Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3777592Z                            module_map=module_map)
2025-05-07T20:32:12.3778083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3778570Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3778926Z E       ^
2025-05-07T20:32:12.3779560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3780296Z 
2025-05-07T20:32:12.3780874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3781590Z 
2025-05-07T20:32:12.3781753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3782322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3782893Z     T=2048,
2025-05-07T20:32:12.3783158Z     D=5120,
2025-05-07T20:32:12.3783421Z     scale_ub=1200.0,
2025-05-07T20:32:12.3783737Z     contiguous=True,
2025-05-07T20:32:12.3784046Z     compiled=True,
2025-05-07T20:32:12.3784356Z )
2025-05-07T20:32:12.3784798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3785338Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.3785607Z 
2025-05-07T20:32:12.3785695Z     @given(
2025-05-07T20:32:12.3786030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3786349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3786665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3786993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3787325Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3787612Z     )
2025-05-07T20:32:12.3787969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3788408Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3788705Z         self,
2025-05-07T20:32:12.3788911Z         T: int,
2025-05-07T20:32:12.3789110Z         D: int,
2025-05-07T20:32:12.3789417Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3789701Z         contiguous: bool,
2025-05-07T20:32:12.3790306Z         compiled: bool,
2025-05-07T20:32:12.3790547Z     ) -> None:
2025-05-07T20:32:12.3790767Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3791014Z     
2025-05-07T20:32:12.3791292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3791639Z     
2025-05-07T20:32:12.3791837Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3792133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3792447Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3792697Z         x0 = x[:, :D]
2025-05-07T20:32:12.3792912Z         x1 = x[:, D:]
2025-05-07T20:32:12.3793125Z     
2025-05-07T20:32:12.3793313Z         if contiguous:
2025-05-07T20:32:12.3793543Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3793807Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3794049Z     
2025-05-07T20:32:12.3794246Z         if scale_ub is not None:
2025-05-07T20:32:12.3794526Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3794865Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3795165Z             )
2025-05-07T20:32:12.3795361Z         else:
2025-05-07T20:32:12.3795585Z             scale_ub_tensor = None
2025-05-07T20:32:12.3795839Z     
2025-05-07T20:32:12.3796082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3796403Z             op = silu_mul_quant
2025-05-07T20:32:12.3796651Z             if compiled:
2025-05-07T20:32:12.3796905Z                 op = torch.compile(op)
2025-05-07T20:32:12.3797207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3797476Z     
2025-05-07T20:32:12.3797677Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.3797970Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.3798268Z     
2025-05-07T20:32:12.3798506Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3798852Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.3799151Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.3799469Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.3799841Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3800158Z     
2025-05-07T20:32:12.3800361Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.3800567Z 
2025-05-07T20:32:12.3800668Z moe/activation_test.py:126: 
2025-05-07T20:32:12.3800972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3801307Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.3801635Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3802420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.3803175Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.3803723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3804414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3805104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.3805978Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3806720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.3807470Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3808193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.3808911Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.3809617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.3810149Z     fn()
2025-05-07T20:32:12.3810658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.3811242Z     self.fn.run(
2025-05-07T20:32:12.3811717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3812247Z     kernel = self.compile(
2025-05-07T20:32:12.3812791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3813438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3813837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3814065Z 
2025-05-07T20:32:12.3814288Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ce57a860>
2025-05-07T20:32:12.3815369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3816733Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bdf05ab0>}
2025-05-07T20:32:12.3818068Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3819097Z context = <triton._C.libtriton.ir.context object at 0x7f07bde2dbb0>
2025-05-07T20:32:12.3819385Z 
2025-05-07T20:32:12.3819559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3820193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3820660Z                            module_map=module_map)
2025-05-07T20:32:12.3821031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3821390Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.3821656Z E       ^
2025-05-07T20:32:12.3822120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3822562Z 
2025-05-07T20:32:12.3822987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3823492Z 
2025-05-07T20:32:12.3823599Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3824015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3824421Z     T=16384,
2025-05-07T20:32:12.3824618Z     D=7168,
2025-05-07T20:32:12.3824808Z     scale_ub=1200.0,
2025-05-07T20:32:12.3825042Z     contiguous=False,
2025-05-07T20:32:12.3825271Z     compiled=False,
2025-05-07T20:32:12.3825473Z )
2025-05-07T20:32:12.3825793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3826294Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.3826627Z 
2025-05-07T20:32:12.3826706Z     @given(
2025-05-07T20:32:12.3826944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3827256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3827561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3827892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3828223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3828515Z     )
2025-05-07T20:32:12.3828860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3829354Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3829675Z         self,
2025-05-07T20:32:12.3829868Z         T: int,
2025-05-07T20:32:12.3830067Z         D: int,
2025-05-07T20:32:12.3830290Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3830558Z         contiguous: bool,
2025-05-07T20:32:12.3830802Z         compiled: bool,
2025-05-07T20:32:12.3831032Z     ) -> None:
2025-05-07T20:32:12.3831245Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3831492Z     
2025-05-07T20:32:12.3831766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3832099Z     
2025-05-07T20:32:12.3832295Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3832589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3832902Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3833139Z         x0 = x[:, :D]
2025-05-07T20:32:12.3833362Z         x1 = x[:, D:]
2025-05-07T20:32:12.3833579Z     
2025-05-07T20:32:12.3833759Z         if contiguous:
2025-05-07T20:32:12.3833996Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3834258Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3834497Z     
2025-05-07T20:32:12.3834694Z         if scale_ub is not None:
2025-05-07T20:32:12.3834968Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3835300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3835613Z             )
2025-05-07T20:32:12.3835810Z         else:
2025-05-07T20:32:12.3836021Z             scale_ub_tensor = None
2025-05-07T20:32:12.3836277Z     
2025-05-07T20:32:12.3836512Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3836821Z             op = silu_mul_quant
2025-05-07T20:32:12.3837075Z             if compiled:
2025-05-07T20:32:12.3837330Z                 op = torch.compile(op)
2025-05-07T20:32:12.3837624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3837901Z     
2025-05-07T20:32:12.3838103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3838268Z 
2025-05-07T20:32:12.3838377Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3838674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3839014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3839300Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3839983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3840674Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3841209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3841888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3842544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3843076Z     kernel = self.compile(
2025-05-07T20:32:12.3843626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3844274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3844671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3844903Z 
2025-05-07T20:32:12.3845165Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ce139e40>
2025-05-07T20:32:12.3846235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3847603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bdf05870>}
2025-05-07T20:32:12.3849089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3850106Z context = <triton._C.libtriton.ir.context object at 0x7f07bdde47f0>
2025-05-07T20:32:12.3850397Z 
2025-05-07T20:32:12.3850563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3851082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3851541Z                            module_map=module_map)
2025-05-07T20:32:12.3851909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3852263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3852520Z E       ^
2025-05-07T20:32:12.3852985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3853433Z 
2025-05-07T20:32:12.3853850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3854363Z 
2025-05-07T20:32:12.3854478Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3854884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3855283Z     T=1,
2025-05-07T20:32:12.3855471Z     D=7168,
2025-05-07T20:32:12.3855666Z     scale_ub=None,
2025-05-07T20:32:12.3855882Z     contiguous=True,
2025-05-07T20:32:12.3856107Z     compiled=True,
2025-05-07T20:32:12.3856307Z )
2025-05-07T20:32:12.3856626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3857108Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.3857361Z 
2025-05-07T20:32:12.3857447Z     @given(
2025-05-07T20:32:12.3857675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3857989Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3858299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3858627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3858959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3859243Z     )
2025-05-07T20:32:12.3867521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3868002Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3868269Z         self,
2025-05-07T20:32:12.3868479Z         T: int,
2025-05-07T20:32:12.3868679Z         D: int,
2025-05-07T20:32:12.3868913Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3869195Z         contiguous: bool,
2025-05-07T20:32:12.3869439Z         compiled: bool,
2025-05-07T20:32:12.3869676Z     ) -> None:
2025-05-07T20:32:12.3869903Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3870145Z     
2025-05-07T20:32:12.3870437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3870795Z     
2025-05-07T20:32:12.3871001Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3871299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3871623Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3871877Z         x0 = x[:, :D]
2025-05-07T20:32:12.3872095Z         x1 = x[:, D:]
2025-05-07T20:32:12.3872311Z     
2025-05-07T20:32:12.3872510Z         if contiguous:
2025-05-07T20:32:12.3872864Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3873134Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3873381Z     
2025-05-07T20:32:12.3873579Z         if scale_ub is not None:
2025-05-07T20:32:12.3873863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3874209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3874516Z             )
2025-05-07T20:32:12.3874719Z         else:
2025-05-07T20:32:12.3874941Z             scale_ub_tensor = None
2025-05-07T20:32:12.3875191Z     
2025-05-07T20:32:12.3875431Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3875805Z             op = silu_mul_quant
2025-05-07T20:32:12.3876067Z             if compiled:
2025-05-07T20:32:12.3876400Z                 op = torch.compile(op)
2025-05-07T20:32:12.3876711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3876998Z     
2025-05-07T20:32:12.3877195Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.3877491Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.3877795Z     
2025-05-07T20:32:12.3878036Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3878381Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.3878684Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.3879002Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.3879373Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3879696Z     
2025-05-07T20:32:12.3879904Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.3880120Z 
2025-05-07T20:32:12.3880226Z moe/activation_test.py:126: 
2025-05-07T20:32:12.3880543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3880890Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.3881222Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3882016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.3882779Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.3883335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3884015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3884762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.3885492Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3886254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.3887006Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3887741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.3888390Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.3888994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.3889520Z     fn()
2025-05-07T20:32:12.3890399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.3890983Z     self.fn.run(
2025-05-07T20:32:12.3891449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3891996Z     kernel = self.compile(
2025-05-07T20:32:12.3892544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3893191Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3893592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3893961Z 
2025-05-07T20:32:12.3894178Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ce045b70>
2025-05-07T20:32:12.3895251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3896613Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8d870>}
2025-05-07T20:32:12.3898129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3899164Z context = <triton._C.libtriton.ir.context object at 0x7f07bd7465f0>
2025-05-07T20:32:12.3899465Z 
2025-05-07T20:32:12.3899631Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3900250Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3900712Z                            module_map=module_map)
2025-05-07T20:32:12.3901082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3901443Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.3901712Z E       ^
2025-05-07T20:32:12.3902177Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3902632Z 
2025-05-07T20:32:12.3903054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3903562Z 
2025-05-07T20:32:12.3903674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3904080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3904486Z     T=4096,
2025-05-07T20:32:12.3904680Z     D=5120,
2025-05-07T20:32:12.3904874Z     scale_ub=None,
2025-05-07T20:32:12.3905091Z     contiguous=False,
2025-05-07T20:32:12.3905322Z     compiled=False,
2025-05-07T20:32:12.3905530Z )
2025-05-07T20:32:12.3905845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3906342Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.3906612Z 
2025-05-07T20:32:12.3906698Z     @given(
2025-05-07T20:32:12.3906931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3907243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3907562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3907887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3908221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3908510Z     )
2025-05-07T20:32:12.3908866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3909304Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3909550Z         self,
2025-05-07T20:32:12.3909751Z         T: int,
2025-05-07T20:32:12.3909947Z         D: int,
2025-05-07T20:32:12.3910172Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3910445Z         contiguous: bool,
2025-05-07T20:32:12.3910685Z         compiled: bool,
2025-05-07T20:32:12.3910910Z     ) -> None:
2025-05-07T20:32:12.3911128Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3911363Z     
2025-05-07T20:32:12.3911643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3911986Z     
2025-05-07T20:32:12.3912188Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3912483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3912795Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3913031Z         x0 = x[:, :D]
2025-05-07T20:32:12.3913251Z         x1 = x[:, D:]
2025-05-07T20:32:12.3913517Z     
2025-05-07T20:32:12.3913705Z         if contiguous:
2025-05-07T20:32:12.3913938Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3914199Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3914441Z     
2025-05-07T20:32:12.3914631Z         if scale_ub is not None:
2025-05-07T20:32:12.3914908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3915247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3915551Z             )
2025-05-07T20:32:12.3915746Z         else:
2025-05-07T20:32:12.3916016Z             scale_ub_tensor = None
2025-05-07T20:32:12.3916263Z     
2025-05-07T20:32:12.3916499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3916891Z             op = silu_mul_quant
2025-05-07T20:32:12.3917144Z             if compiled:
2025-05-07T20:32:12.3917396Z                 op = torch.compile(op)
2025-05-07T20:32:12.3917695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3917966Z     
2025-05-07T20:32:12.3918165Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3918337Z 
2025-05-07T20:32:12.3918437Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3918738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3919067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3919354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3920041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3920730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3921272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3921955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3922625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3923154Z     kernel = self.compile(
2025-05-07T20:32:12.3923695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3924349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3924739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3924970Z 
2025-05-07T20:32:12.3925177Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd7d61a0>
2025-05-07T20:32:12.3926259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3927639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8eb90>}
2025-05-07T20:32:12.3928979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3929994Z context = <triton._C.libtriton.ir.context object at 0x7f07bd7b1670>
2025-05-07T20:32:12.3930285Z 
2025-05-07T20:32:12.3930454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3930975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3931449Z                            module_map=module_map)
2025-05-07T20:32:12.3931814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3932170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3932427Z E       ^
2025-05-07T20:32:12.3932885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3933387Z 
2025-05-07T20:32:12.3933798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3934312Z 
2025-05-07T20:32:12.3934417Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3934831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3935225Z     T=4096,
2025-05-07T20:32:12.3935418Z     D=7168,
2025-05-07T20:32:12.3935615Z     scale_ub=None,
2025-05-07T20:32:12.3935827Z     contiguous=False,
2025-05-07T20:32:12.3936057Z     compiled=False,
2025-05-07T20:32:12.3936313Z )
2025-05-07T20:32:12.3936626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3937218Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.3937500Z 
2025-05-07T20:32:12.3937578Z     @given(
2025-05-07T20:32:12.3937815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3938127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3938442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3938773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3939095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3939383Z     )
2025-05-07T20:32:12.3939733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3940227Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3940476Z         self,
2025-05-07T20:32:12.3940679Z         T: int,
2025-05-07T20:32:12.3940881Z         D: int,
2025-05-07T20:32:12.3941105Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3941383Z         contiguous: bool,
2025-05-07T20:32:12.3941624Z         compiled: bool,
2025-05-07T20:32:12.3941853Z     ) -> None:
2025-05-07T20:32:12.3942082Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3942327Z     
2025-05-07T20:32:12.3942597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3942945Z     
2025-05-07T20:32:12.3943150Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3943441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3943758Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3944004Z         x0 = x[:, :D]
2025-05-07T20:32:12.3944251Z         x1 = x[:, D:]
2025-05-07T20:32:12.3944478Z     
2025-05-07T20:32:12.3944669Z         if contiguous:
2025-05-07T20:32:12.3944909Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3945165Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3945416Z     
2025-05-07T20:32:12.3945612Z         if scale_ub is not None:
2025-05-07T20:32:12.3945882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3946222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3946539Z             )
2025-05-07T20:32:12.3946728Z         else:
2025-05-07T20:32:12.3946943Z             scale_ub_tensor = None
2025-05-07T20:32:12.3947198Z     
2025-05-07T20:32:12.3947428Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3947747Z             op = silu_mul_quant
2025-05-07T20:32:12.3948002Z             if compiled:
2025-05-07T20:32:12.3948249Z                 op = torch.compile(op)
2025-05-07T20:32:12.3948550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3948827Z     
2025-05-07T20:32:12.3949021Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3949190Z 
2025-05-07T20:32:12.3949291Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3949590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3949926Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3950207Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3950899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3951593Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3952132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3952865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3953530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3954060Z     kernel = self.compile(
2025-05-07T20:32:12.3954595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3955253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3955773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3956002Z 
2025-05-07T20:32:12.3956220Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd7d5810>
2025-05-07T20:32:12.3957290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3958655Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8eb00>}
2025-05-07T20:32:12.3959991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3961011Z context = <triton._C.libtriton.ir.context object at 0x7f07bd3a3ef0>
2025-05-07T20:32:12.3961296Z 
2025-05-07T20:32:12.3961480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3961999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3962468Z                            module_map=module_map)
2025-05-07T20:32:12.3962841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3963190Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3963453Z E       ^
2025-05-07T20:32:12.3963921Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3963926Z 
2025-05-07T20:32:12.3964339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3964344Z 
2025-05-07T20:32:12.3964454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3964682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3964765Z     T=128,
2025-05-07T20:32:12.3964852Z     D=7168,
2025-05-07T20:32:12.3964936Z     scale_ub=None,
2025-05-07T20:32:12.3965024Z     contiguous=False,
2025-05-07T20:32:12.3965115Z     compiled=True,
2025-05-07T20:32:12.3965190Z )
2025-05-07T20:32:12.3965406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3965589Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.3965593Z 
2025-05-07T20:32:12.3965674Z     @given(
2025-05-07T20:32:12.3965803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3965905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3966024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3966149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3966264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3966341Z     )
2025-05-07T20:32:12.3966598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3966692Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3966778Z         self,
2025-05-07T20:32:12.3966858Z         T: int,
2025-05-07T20:32:12.3966935Z         D: int,
2025-05-07T20:32:12.3967048Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3967192Z         contiguous: bool,
2025-05-07T20:32:12.3967279Z         compiled: bool,
2025-05-07T20:32:12.3967364Z     ) -> None:
2025-05-07T20:32:12.3967459Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3967532Z     
2025-05-07T20:32:12.3967707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3967782Z     
2025-05-07T20:32:12.3967876Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3968011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3968101Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3968225Z         x0 = x[:, :D]
2025-05-07T20:32:12.3968312Z         x1 = x[:, D:]
2025-05-07T20:32:12.3968383Z     
2025-05-07T20:32:12.3968547Z         if contiguous:
2025-05-07T20:32:12.3968644Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3968735Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3968816Z     
2025-05-07T20:32:12.3968908Z         if scale_ub is not None:
2025-05-07T20:32:12.3969020Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3969166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3969243Z             )
2025-05-07T20:32:12.3969319Z         else:
2025-05-07T20:32:12.3969422Z             scale_ub_tensor = None
2025-05-07T20:32:12.3969494Z     
2025-05-07T20:32:12.3969627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3969725Z             op = silu_mul_quant
2025-05-07T20:32:12.3969811Z             if compiled:
2025-05-07T20:32:12.3969919Z                 op = torch.compile(op)
2025-05-07T20:32:12.3970034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3970105Z     
2025-05-07T20:32:12.3970209Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.3970333Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.3970406Z     
2025-05-07T20:32:12.3970552Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3970657Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.3970762Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.3970892Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.3971033Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3971106Z     
2025-05-07T20:32:12.3971216Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.3971220Z 
2025-05-07T20:32:12.3971320Z moe/activation_test.py:126: 
2025-05-07T20:32:12.3971453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3971566Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.3971702Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3972272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.3972378Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.3972737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3972968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3973335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.3973600Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3973997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.3974261Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3974705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.3974876Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.3975228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.3975354Z     fn()
2025-05-07T20:32:12.3975755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.3975850Z     self.fn.run(
2025-05-07T20:32:12.3976189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3976285Z     kernel = self.compile(
2025-05-07T20:32:12.3976677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3976972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3977109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3977114Z 
2025-05-07T20:32:12.3977322Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd7d7be0>
2025-05-07T20:32:12.3978099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3978614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bde8fac0>}
2025-05-07T20:32:12.3979353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3979556Z context = <triton._C.libtriton.ir.context object at 0x7f07bd1a0270>
2025-05-07T20:32:12.3979561Z 
2025-05-07T20:32:12.3979726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3980062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3980173Z                            module_map=module_map)
2025-05-07T20:32:12.3980343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3980453Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.3980532Z E       ^
2025-05-07T20:32:12.3980887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3980892Z 
2025-05-07T20:32:12.3981309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3981317Z 
2025-05-07T20:32:12.3981420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3981650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3981729Z     T=128,
2025-05-07T20:32:12.3981806Z     D=7168,
2025-05-07T20:32:12.3981895Z     scale_ub=None,
2025-05-07T20:32:12.3981984Z     contiguous=False,
2025-05-07T20:32:12.3982072Z     compiled=False,
2025-05-07T20:32:12.3982149Z )
2025-05-07T20:32:12.3982366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3982540Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.3982552Z 
2025-05-07T20:32:12.3982633Z     @given(
2025-05-07T20:32:12.3982757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3982863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3982981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3983104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3983228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3983308Z     )
2025-05-07T20:32:12.3983554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3983658Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3983739Z         self,
2025-05-07T20:32:12.3983815Z         T: int,
2025-05-07T20:32:12.3983947Z         D: int,
2025-05-07T20:32:12.3984049Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3984147Z         contiguous: bool,
2025-05-07T20:32:12.3984234Z         compiled: bool,
2025-05-07T20:32:12.3984312Z     ) -> None:
2025-05-07T20:32:12.3984416Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3984487Z     
2025-05-07T20:32:12.3984655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3984736Z     
2025-05-07T20:32:12.3984833Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3985026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3985123Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3985281Z         x0 = x[:, :D]
2025-05-07T20:32:12.3985364Z         x1 = x[:, D:]
2025-05-07T20:32:12.3985444Z     
2025-05-07T20:32:12.3985528Z         if contiguous:
2025-05-07T20:32:12.3985629Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3985719Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3985796Z     
2025-05-07T20:32:12.3985895Z         if scale_ub is not None:
2025-05-07T20:32:12.3986001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3986137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3986222Z             )
2025-05-07T20:32:12.3986300Z         else:
2025-05-07T20:32:12.3986399Z             scale_ub_tensor = None
2025-05-07T20:32:12.3986477Z     
2025-05-07T20:32:12.3986608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3986700Z             op = silu_mul_quant
2025-05-07T20:32:12.3986797Z             if compiled:
2025-05-07T20:32:12.3986899Z                 op = torch.compile(op)
2025-05-07T20:32:12.3987019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3987090Z     
2025-05-07T20:32:12.3987185Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3987189Z 
2025-05-07T20:32:12.3987294Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3987428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3987533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3987642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3988139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3988240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3988602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3988824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3989179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3989278Z     kernel = self.compile(
2025-05-07T20:32:12.3989665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3990118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3990301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3990308Z 
2025-05-07T20:32:12.3990553Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd902bc0>
2025-05-07T20:32:12.3991334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3991851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bda38f70>}
2025-05-07T20:32:12.3992602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3992911Z context = <triton._C.libtriton.ir.context object at 0x7f07bd1d07b0>
2025-05-07T20:32:12.3992916Z 
2025-05-07T20:32:12.3993092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3993353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3993464Z                            module_map=module_map)
2025-05-07T20:32:12.3993633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3993731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3993885Z E       ^
2025-05-07T20:32:12.3994347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3994352Z 
2025-05-07T20:32:12.3994774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3994779Z 
2025-05-07T20:32:12.3994896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3995119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3995207Z     T=4096,
2025-05-07T20:32:12.3995289Z     D=5120,
2025-05-07T20:32:12.3995374Z     scale_ub=1200.0,
2025-05-07T20:32:12.3995468Z     contiguous=True,
2025-05-07T20:32:12.3995553Z     compiled=False,
2025-05-07T20:32:12.3995624Z )
2025-05-07T20:32:12.3995847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3996023Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.3996031Z 
2025-05-07T20:32:12.3996108Z     @given(
2025-05-07T20:32:12.3996242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3996342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3996465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3996585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3996702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3996780Z     )
2025-05-07T20:32:12.3997026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3997120Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3997205Z         self,
2025-05-07T20:32:12.3997282Z         T: int,
2025-05-07T20:32:12.3997359Z         D: int,
2025-05-07T20:32:12.3997469Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3997565Z         contiguous: bool,
2025-05-07T20:32:12.3997653Z         compiled: bool,
2025-05-07T20:32:12.3997741Z     ) -> None:
2025-05-07T20:32:12.3997841Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3997912Z     
2025-05-07T20:32:12.3998093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3998169Z     
2025-05-07T20:32:12.3998269Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3998396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3998491Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3998577Z         x0 = x[:, :D]
2025-05-07T20:32:12.3998659Z         x1 = x[:, D:]
2025-05-07T20:32:12.3998733Z     
2025-05-07T20:32:12.3998823Z         if contiguous:
2025-05-07T20:32:12.3998917Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3999008Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3999086Z     
2025-05-07T20:32:12.3999177Z         if scale_ub is not None:
2025-05-07T20:32:12.3999287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3999429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3999509Z             )
2025-05-07T20:32:12.3999598Z         else:
2025-05-07T20:32:12.3999702Z             scale_ub_tensor = None
2025-05-07T20:32:12.3999777Z     
2025-05-07T20:32:12.3999914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4000006Z             op = silu_mul_quant
2025-05-07T20:32:12.4000092Z             if compiled:
2025-05-07T20:32:12.4000202Z                 op = torch.compile(op)
2025-05-07T20:32:12.4000364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4000439Z     
2025-05-07T20:32:12.4000538Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4000542Z 
2025-05-07T20:32:12.4000642Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4000778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4000881Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4000987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4001489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4001706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4002072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4002309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4002656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4002763Z     kernel = self.compile(
2025-05-07T20:32:12.4003148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4003326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4003460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4003465Z 
2025-05-07T20:32:12.4003675Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bdfaace0>
2025-05-07T20:32:12.4004457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4004966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bda39510>}
2025-05-07T20:32:12.4005721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4005917Z context = <triton._C.libtriton.ir.context object at 0x7f07bcc9d630>
2025-05-07T20:32:12.4005922Z 
2025-05-07T20:32:12.4006092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4006369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4006486Z                            module_map=module_map)
2025-05-07T20:32:12.4006650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4006757Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4006835Z E       ^
2025-05-07T20:32:12.4007190Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4007207Z 
2025-05-07T20:32:12.4007620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4007625Z 
2025-05-07T20:32:12.4007730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4007956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4008035Z     T=1,
2025-05-07T20:32:12.4008112Z     D=5120,
2025-05-07T20:32:12.4008206Z     scale_ub=None,
2025-05-07T20:32:12.4008291Z     contiguous=True,
2025-05-07T20:32:12.4008374Z     compiled=True,
2025-05-07T20:32:12.4008469Z )
2025-05-07T20:32:12.4008693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4008860Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4008864Z 
2025-05-07T20:32:12.4014806Z     @given(
2025-05-07T20:32:12.4015041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4015147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4015277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4015401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4015523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4015611Z     )
2025-05-07T20:32:12.4015862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4015960Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4016094Z         self,
2025-05-07T20:32:12.4016177Z         T: int,
2025-05-07T20:32:12.4016260Z         D: int,
2025-05-07T20:32:12.4016929Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4017028Z         contiguous: bool,
2025-05-07T20:32:12.4017127Z         compiled: bool,
2025-05-07T20:32:12.4017213Z     ) -> None:
2025-05-07T20:32:12.4017314Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4017408Z     
2025-05-07T20:32:12.4017586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4017662Z     
2025-05-07T20:32:12.4017764Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4017893Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4017988Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4018077Z         x0 = x[:, :D]
2025-05-07T20:32:12.4018161Z         x1 = x[:, D:]
2025-05-07T20:32:12.4018238Z     
2025-05-07T20:32:12.4018332Z         if contiguous:
2025-05-07T20:32:12.4018429Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4018534Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4018614Z     
2025-05-07T20:32:12.4018715Z         if scale_ub is not None:
2025-05-07T20:32:12.4018833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4018975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4019056Z             )
2025-05-07T20:32:12.4019146Z         else:
2025-05-07T20:32:12.4019249Z             scale_ub_tensor = None
2025-05-07T20:32:12.4019327Z     
2025-05-07T20:32:12.4019474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4019572Z             op = silu_mul_quant
2025-05-07T20:32:12.4019664Z             if compiled:
2025-05-07T20:32:12.4019872Z                 op = torch.compile(op)
2025-05-07T20:32:12.4019986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4020074Z     
2025-05-07T20:32:12.4020173Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4020303Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4020394Z     
2025-05-07T20:32:12.4020534Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4020652Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4020770Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4020900Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4021048Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4021138Z     
2025-05-07T20:32:12.4021244Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4021249Z 
2025-05-07T20:32:12.4021370Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4021505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4021616Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4021768Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4022336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4022448Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4022828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4023055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4023512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4023775Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4024185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4024449Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4024826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4025121Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4025473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4025555Z     fn()
2025-05-07T20:32:12.4025965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4026055Z     self.fn.run(
2025-05-07T20:32:12.4026402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4026512Z     kernel = self.compile(
2025-05-07T20:32:12.4026898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4027086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4027219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4027226Z 
2025-05-07T20:32:12.4027444Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bde9c130>
2025-05-07T20:32:12.4028231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4028739Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bdf070a0>}
2025-05-07T20:32:12.4029493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4029689Z context = <triton._C.libtriton.ir.context object at 0x7f07bccf9930>
2025-05-07T20:32:12.4029697Z 
2025-05-07T20:32:12.4029865Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4030147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4030260Z                            module_map=module_map)
2025-05-07T20:32:12.4030440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4030549Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4030630Z E       ^
2025-05-07T20:32:12.4031000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4031004Z 
2025-05-07T20:32:12.4031422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4031427Z 
2025-05-07T20:32:12.4031544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4031770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4031855Z     T=2048,
2025-05-07T20:32:12.4031944Z     D=5120,
2025-05-07T20:32:12.4032032Z     scale_ub=None,
2025-05-07T20:32:12.4032129Z     contiguous=True,
2025-05-07T20:32:12.4032225Z     compiled=True,
2025-05-07T20:32:12.4032303Z )
2025-05-07T20:32:12.4032523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4032706Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4032758Z 
2025-05-07T20:32:12.4032840Z     @given(
2025-05-07T20:32:12.4032973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4033080Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4033200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4033331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4033450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4033529Z     )
2025-05-07T20:32:12.4033787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4033931Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4034085Z         self,
2025-05-07T20:32:12.4034178Z         T: int,
2025-05-07T20:32:12.4034260Z         D: int,
2025-05-07T20:32:12.4034365Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4034465Z         contiguous: bool,
2025-05-07T20:32:12.4034555Z         compiled: bool,
2025-05-07T20:32:12.4034647Z     ) -> None:
2025-05-07T20:32:12.4034747Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4034824Z     
2025-05-07T20:32:12.4035004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4035081Z     
2025-05-07T20:32:12.4035176Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4035312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4035406Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4035490Z         x0 = x[:, :D]
2025-05-07T20:32:12.4035581Z         x1 = x[:, D:]
2025-05-07T20:32:12.4035661Z     
2025-05-07T20:32:12.4035751Z         if contiguous:
2025-05-07T20:32:12.4035855Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4035958Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4036043Z     
2025-05-07T20:32:12.4036139Z         if scale_ub is not None:
2025-05-07T20:32:12.4036249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4036398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4036482Z             )
2025-05-07T20:32:12.4036566Z         else:
2025-05-07T20:32:12.4036674Z             scale_ub_tensor = None
2025-05-07T20:32:12.4036752Z     
2025-05-07T20:32:12.4036888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4036987Z             op = silu_mul_quant
2025-05-07T20:32:12.4037076Z             if compiled:
2025-05-07T20:32:12.4037181Z                 op = torch.compile(op)
2025-05-07T20:32:12.4037299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4037379Z     
2025-05-07T20:32:12.4037474Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4037606Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4037688Z     
2025-05-07T20:32:12.4037830Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4037942Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4038046Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4038181Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4038324Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4038401Z     
2025-05-07T20:32:12.4038511Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4038516Z 
2025-05-07T20:32:12.4038621Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4038753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4038872Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4039009Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4039580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4039688Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4040048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4040330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4040702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4040961Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4041369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4041627Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4042125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4042296Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4042639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4042728Z     fn()
2025-05-07T20:32:12.4043136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4043230Z     self.fn.run(
2025-05-07T20:32:12.4043571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4043670Z     kernel = self.compile(
2025-05-07T20:32:12.4044056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4044235Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4044372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4044377Z 
2025-05-07T20:32:12.4044591Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bdf6cd00>
2025-05-07T20:32:12.4045382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4045898Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bd48a7a0>}
2025-05-07T20:32:12.4046642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4046845Z context = <triton._C.libtriton.ir.context object at 0x7f07bcdc7170>
2025-05-07T20:32:12.4046849Z 
2025-05-07T20:32:12.4047023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4047287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4047405Z                            module_map=module_map)
2025-05-07T20:32:12.4047572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4047678Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4047766Z E       ^
2025-05-07T20:32:12.4048122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4048126Z 
2025-05-07T20:32:12.4048548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4048553Z 
2025-05-07T20:32:12.4048660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4048886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4048978Z     T=128,
2025-05-07T20:32:12.4049059Z     D=5120,
2025-05-07T20:32:12.4049147Z     scale_ub=None,
2025-05-07T20:32:12.4049241Z     contiguous=True,
2025-05-07T20:32:12.4049328Z     compiled=True,
2025-05-07T20:32:12.4049412Z )
2025-05-07T20:32:12.4049636Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4049858Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4049863Z 
2025-05-07T20:32:12.4049946Z     @given(
2025-05-07T20:32:12.4050069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4050170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4050295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4050415Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4050533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4050660Z     )
2025-05-07T20:32:12.4051010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4051117Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4051197Z         self,
2025-05-07T20:32:12.4051278Z         T: int,
2025-05-07T20:32:12.4051364Z         D: int,
2025-05-07T20:32:12.4051469Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4051566Z         contiguous: bool,
2025-05-07T20:32:12.4051662Z         compiled: bool,
2025-05-07T20:32:12.4051745Z     ) -> None:
2025-05-07T20:32:12.4051844Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4051926Z     
2025-05-07T20:32:12.4052096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4052173Z     
2025-05-07T20:32:12.4052280Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4052406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4052505Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4052594Z         x0 = x[:, :D]
2025-05-07T20:32:12.4052678Z         x1 = x[:, D:]
2025-05-07T20:32:12.4052761Z     
2025-05-07T20:32:12.4052854Z         if contiguous:
2025-05-07T20:32:12.4052951Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4053050Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4053126Z     
2025-05-07T20:32:12.4053220Z         if scale_ub is not None:
2025-05-07T20:32:12.4053337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4053477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4053556Z             )
2025-05-07T20:32:12.4053643Z         else:
2025-05-07T20:32:12.4053742Z             scale_ub_tensor = None
2025-05-07T20:32:12.4053828Z     
2025-05-07T20:32:12.4053962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4054055Z             op = silu_mul_quant
2025-05-07T20:32:12.4054151Z             if compiled:
2025-05-07T20:32:12.4054259Z                 op = torch.compile(op)
2025-05-07T20:32:12.4054373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4054463Z     
2025-05-07T20:32:12.4054564Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4054689Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4054774Z     
2025-05-07T20:32:12.4054915Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4055020Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4055134Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4055259Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4055407Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4055486Z     
2025-05-07T20:32:12.4055589Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4055594Z 
2025-05-07T20:32:12.4055704Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4055836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4055950Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4056099Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4056661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4056776Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4057137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4057415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4057797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4058056Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4058453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4058832Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4059215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4059393Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4059736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4059930Z     fn()
2025-05-07T20:32:12.4060337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4060424Z     self.fn.run(
2025-05-07T20:32:12.4060768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4060864Z     kernel = self.compile(
2025-05-07T20:32:12.4061242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4061436Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4061564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4061569Z 
2025-05-07T20:32:12.4061776Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bcc46170>
2025-05-07T20:32:12.4062557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4063056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bd48bc70>}
2025-05-07T20:32:12.4063801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4063999Z context = <triton._C.libtriton.ir.context object at 0x7f07bc8288b0>
2025-05-07T20:32:12.4064003Z 
2025-05-07T20:32:12.4064176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4064439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4064554Z                            module_map=module_map)
2025-05-07T20:32:12.4064725Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4064829Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4064906Z E       ^
2025-05-07T20:32:12.4065265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4065270Z 
2025-05-07T20:32:12.4065681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4065687Z 
2025-05-07T20:32:12.4065799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4066029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4066109Z     T=4096,
2025-05-07T20:32:12.4066189Z     D=5120,
2025-05-07T20:32:12.4066272Z     scale_ub=None,
2025-05-07T20:32:12.4066358Z     contiguous=True,
2025-05-07T20:32:12.4066542Z     compiled=True,
2025-05-07T20:32:12.4066615Z )
2025-05-07T20:32:12.4066836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4067008Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4067013Z 
2025-05-07T20:32:12.4067090Z     @given(
2025-05-07T20:32:12.4067216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4067318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4067437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4067608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4067726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4067902Z     )
2025-05-07T20:32:12.4068156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4068252Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4068338Z         self,
2025-05-07T20:32:12.4068418Z         T: int,
2025-05-07T20:32:12.4068496Z         D: int,
2025-05-07T20:32:12.4068604Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4068694Z         contiguous: bool,
2025-05-07T20:32:12.4068781Z         compiled: bool,
2025-05-07T20:32:12.4068864Z     ) -> None:
2025-05-07T20:32:12.4068959Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4069031Z     
2025-05-07T20:32:12.4069205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4069281Z     
2025-05-07T20:32:12.4069376Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4069515Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4069607Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4069700Z         x0 = x[:, :D]
2025-05-07T20:32:12.4069783Z         x1 = x[:, D:]
2025-05-07T20:32:12.4069859Z     
2025-05-07T20:32:12.4069949Z         if contiguous:
2025-05-07T20:32:12.4070043Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4070135Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4070221Z     
2025-05-07T20:32:12.4070312Z         if scale_ub is not None:
2025-05-07T20:32:12.4070421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4070561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4070639Z             )
2025-05-07T20:32:12.4070716Z         else:
2025-05-07T20:32:12.4070820Z             scale_ub_tensor = None
2025-05-07T20:32:12.4070894Z     
2025-05-07T20:32:12.4071026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4071125Z             op = silu_mul_quant
2025-05-07T20:32:12.4071214Z             if compiled:
2025-05-07T20:32:12.4071322Z                 op = torch.compile(op)
2025-05-07T20:32:12.4071439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4071512Z     
2025-05-07T20:32:12.4071615Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4071738Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4071812Z     
2025-05-07T20:32:12.4071958Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4072062Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4072165Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4072297Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4072441Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4072522Z     
2025-05-07T20:32:12.4072626Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4072631Z 
2025-05-07T20:32:12.4072732Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4072872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4072985Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4073122Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4073683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4073839Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4074212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4074435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4074800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4075062Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4075508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4075839Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4076221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4076393Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4076738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4076817Z     fn()
2025-05-07T20:32:12.4077217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4077307Z     self.fn.run(
2025-05-07T20:32:12.4077642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4077761Z     kernel = self.compile(
2025-05-07T20:32:12.4078154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4078337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4078466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4078470Z 
2025-05-07T20:32:12.4078683Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bcf436a0>
2025-05-07T20:32:12.4079477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4079975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bcfa4940>}
2025-05-07T20:32:12.4080744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4080940Z context = <triton._C.libtriton.ir.context object at 0x7f07bc355870>
2025-05-07T20:32:12.4080945Z 
2025-05-07T20:32:12.4081110Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4081384Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4081494Z                            module_map=module_map)
2025-05-07T20:32:12.4081668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4081772Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4081850Z E       ^
2025-05-07T20:32:12.4082208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4082216Z 
2025-05-07T20:32:12.4082632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4082637Z 
2025-05-07T20:32:12.4082749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4082969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4083050Z     T=16384,
2025-05-07T20:32:12.4083181Z     D=5120,
2025-05-07T20:32:12.4083266Z     scale_ub=None,
2025-05-07T20:32:12.4083353Z     contiguous=True,
2025-05-07T20:32:12.4083446Z     compiled=True,
2025-05-07T20:32:12.4083519Z )
2025-05-07T20:32:12.4083735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4083917Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4083921Z 
2025-05-07T20:32:12.4084000Z     @given(
2025-05-07T20:32:12.4084127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4084279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4084398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4084627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4084745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4084820Z     )
2025-05-07T20:32:12.4085072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4085171Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4085250Z         self,
2025-05-07T20:32:12.4085337Z         T: int,
2025-05-07T20:32:12.4085414Z         D: int,
2025-05-07T20:32:12.4085514Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4085610Z         contiguous: bool,
2025-05-07T20:32:12.4085697Z         compiled: bool,
2025-05-07T20:32:12.4085783Z     ) -> None:
2025-05-07T20:32:12.4085878Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4085951Z     
2025-05-07T20:32:12.4086124Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4086207Z     
2025-05-07T20:32:12.4086305Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4086442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4086532Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4086617Z         x0 = x[:, :D]
2025-05-07T20:32:12.4086706Z         x1 = x[:, D:]
2025-05-07T20:32:12.4086778Z     
2025-05-07T20:32:12.4086862Z         if contiguous:
2025-05-07T20:32:12.4086965Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4087055Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4087126Z     
2025-05-07T20:32:12.4087226Z         if scale_ub is not None:
2025-05-07T20:32:12.4087336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4087478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4087557Z             )
2025-05-07T20:32:12.4087634Z         else:
2025-05-07T20:32:12.4087737Z             scale_ub_tensor = None
2025-05-07T20:32:12.4087811Z     
2025-05-07T20:32:12.4087947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4088045Z             op = silu_mul_quant
2025-05-07T20:32:12.4088139Z             if compiled:
2025-05-07T20:32:12.4088243Z                 op = torch.compile(op)
2025-05-07T20:32:12.4088358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4088432Z     
2025-05-07T20:32:12.4088524Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4088656Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4088729Z     
2025-05-07T20:32:12.4088875Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4088976Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4089078Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4089209Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4089350Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4089425Z     
2025-05-07T20:32:12.4089531Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4089539Z 
2025-05-07T20:32:12.4089640Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4089783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4090186Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4090386Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4090984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4091247Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4091604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4091836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4092200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4092541Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4093052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4093306Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4093696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4093868Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4094217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4094298Z     fn()
2025-05-07T20:32:12.4094698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4094787Z     self.fn.run(
2025-05-07T20:32:12.4095134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4095235Z     kernel = self.compile(
2025-05-07T20:32:12.4095620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4095797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4095941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4095946Z 
2025-05-07T20:32:12.4096153Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bdabab00>
2025-05-07T20:32:12.4096925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4097430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bd48a9e0>}
2025-05-07T20:32:12.4098180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4098380Z context = <triton._C.libtriton.ir.context object at 0x7f06abfdeb30>
2025-05-07T20:32:12.4098386Z 
2025-05-07T20:32:12.4098551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4098814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4098930Z                            module_map=module_map)
2025-05-07T20:32:12.4099094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4099207Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4099286Z E       ^
2025-05-07T20:32:12.4099639Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4099646Z 
2025-05-07T20:32:12.4100179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4100184Z 
2025-05-07T20:32:12.4100288Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4100517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4100648Z     T=1,
2025-05-07T20:32:12.4100725Z     D=5120,
2025-05-07T20:32:12.4100817Z     scale_ub=1200.0,
2025-05-07T20:32:12.4100905Z     contiguous=True,
2025-05-07T20:32:12.4100989Z     compiled=True,
2025-05-07T20:32:12.4101068Z )
2025-05-07T20:32:12.4101291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4101457Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4101462Z 
2025-05-07T20:32:12.4101547Z     @given(
2025-05-07T20:32:12.4101710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4101815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4102007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4102128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4102248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4102323Z     )
2025-05-07T20:32:12.4102576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4102679Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4102757Z         self,
2025-05-07T20:32:12.4102833Z         T: int,
2025-05-07T20:32:12.4102917Z         D: int,
2025-05-07T20:32:12.4103016Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4103107Z         contiguous: bool,
2025-05-07T20:32:12.4103199Z         compiled: bool,
2025-05-07T20:32:12.4103277Z     ) -> None:
2025-05-07T20:32:12.4103378Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4103458Z     
2025-05-07T20:32:12.4103630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4103709Z     
2025-05-07T20:32:12.4103812Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4103938Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4104034Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4104113Z         x0 = x[:, :D]
2025-05-07T20:32:12.4104199Z         x1 = x[:, D:]
2025-05-07T20:32:12.4104277Z     
2025-05-07T20:32:12.4104360Z         if contiguous:
2025-05-07T20:32:12.4104454Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4104554Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4104625Z     
2025-05-07T20:32:12.4104716Z         if scale_ub is not None:
2025-05-07T20:32:12.4104829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4104966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4105047Z             )
2025-05-07T20:32:12.4105125Z         else:
2025-05-07T20:32:12.4105226Z             scale_ub_tensor = None
2025-05-07T20:32:12.4105304Z     
2025-05-07T20:32:12.4105441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4105535Z             op = silu_mul_quant
2025-05-07T20:32:12.4105624Z             if compiled:
2025-05-07T20:32:12.4105724Z                 op = torch.compile(op)
2025-05-07T20:32:12.4105834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4105913Z     
2025-05-07T20:32:12.4106005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4106010Z 
2025-05-07T20:32:12.4106116Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4106244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4106347Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4106453Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4106821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4106921Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4107424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4107525Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4107891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4108165Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4108512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4108613Z     kernel = self.compile(
2025-05-07T20:32:12.4108995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4109174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4109306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4109352Z 
2025-05-07T20:32:12.4109631Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bd440400>
2025-05-07T20:32:12.4110411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4110921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bcfa68c0>}
2025-05-07T20:32:12.4111669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4111860Z context = <triton._C.libtriton.ir.context object at 0x7f06ab86acf0>
2025-05-07T20:32:12.4111868Z 
2025-05-07T20:32:12.4112035Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4112313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4112423Z                            module_map=module_map)
2025-05-07T20:32:12.4112587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4112694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4112775Z E       ^
2025-05-07T20:32:12.4113133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4113138Z 
2025-05-07T20:32:12.4113549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4113554Z 
2025-05-07T20:32:12.4113660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4113888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4113969Z     T=1,
2025-05-07T20:32:12.4114057Z     D=5120,
2025-05-07T20:32:12.4114144Z     scale_ub=None,
2025-05-07T20:32:12.4114238Z     contiguous=False,
2025-05-07T20:32:12.4114327Z     compiled=True,
2025-05-07T20:32:12.4114401Z )
2025-05-07T20:32:12.4114622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4114796Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4114803Z 
2025-05-07T20:32:12.4114881Z     @given(
2025-05-07T20:32:12.4115002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4115112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4115234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4115361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4115478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4115553Z     )
2025-05-07T20:32:12.4115803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4115901Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4115978Z         self,
2025-05-07T20:32:12.4116072Z         T: int,
2025-05-07T20:32:12.4116150Z         D: int,
2025-05-07T20:32:12.4116252Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4116348Z         contiguous: bool,
2025-05-07T20:32:12.4116433Z         compiled: bool,
2025-05-07T20:32:12.4116566Z     ) -> None:
2025-05-07T20:32:12.4116668Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4116738Z     
2025-05-07T20:32:12.4116918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4116993Z     
2025-05-07T20:32:12.4117086Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4117217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4117307Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4117389Z         x0 = x[:, :D]
2025-05-07T20:32:12.4117475Z         x1 = x[:, D:]
2025-05-07T20:32:12.4117612Z     
2025-05-07T20:32:12.4117698Z         if contiguous:
2025-05-07T20:32:12.4117799Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4117966Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4118038Z     
2025-05-07T20:32:12.4118139Z         if scale_ub is not None:
2025-05-07T20:32:12.4118246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4118381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4118468Z             )
2025-05-07T20:32:12.4118544Z         else:
2025-05-07T20:32:12.4118649Z             scale_ub_tensor = None
2025-05-07T20:32:12.4118721Z     
2025-05-07T20:32:12.4118852Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4118949Z             op = silu_mul_quant
2025-05-07T20:32:12.4119037Z             if compiled:
2025-05-07T20:32:12.4119139Z                 op = torch.compile(op)
2025-05-07T20:32:12.4119253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4119328Z     
2025-05-07T20:32:12.4119425Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4119557Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4119631Z     
2025-05-07T20:32:12.4119777Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4119889Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4119991Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4120121Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4120266Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4120340Z     
2025-05-07T20:32:12.4120452Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4120456Z 
2025-05-07T20:32:12.4120558Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4120685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4120798Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4120934Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4121502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4121606Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4121963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4122191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4122558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4122821Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4123222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4123474Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4123868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4124039Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4124378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4124459Z     fn()
2025-05-07T20:32:12.4124914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4125004Z     self.fn.run(
2025-05-07T20:32:12.4125340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4125436Z     kernel = self.compile(
2025-05-07T20:32:12.4125822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4125998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4126169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4126254Z 
2025-05-07T20:32:12.4126470Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc7d4070>
2025-05-07T20:32:12.4127244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4127758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc8a7880>}
2025-05-07T20:32:12.4128499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4128703Z context = <triton._C.libtriton.ir.context object at 0x7f06ab81b7f0>
2025-05-07T20:32:12.4128708Z 
2025-05-07T20:32:12.4128880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4129148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4129264Z                            module_map=module_map)
2025-05-07T20:32:12.4129429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4129534Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4129616Z E       ^
2025-05-07T20:32:12.4129968Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4129973Z 
2025-05-07T20:32:12.4130390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4130394Z 
2025-05-07T20:32:12.4130500Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4130722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4130809Z     T=1,
2025-05-07T20:32:12.4130892Z     D=5120,
2025-05-07T20:32:12.4130983Z     scale_ub=None,
2025-05-07T20:32:12.4131070Z     contiguous=True,
2025-05-07T20:32:12.4131155Z     compiled=False,
2025-05-07T20:32:12.4131234Z )
2025-05-07T20:32:12.4131450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4131620Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4131624Z 
2025-05-07T20:32:12.4131710Z     @given(
2025-05-07T20:32:12.4131832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4131934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4132057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4132177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4132302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4132379Z     )
2025-05-07T20:32:12.4132628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4132732Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4132810Z         self,
2025-05-07T20:32:12.4132889Z         T: int,
2025-05-07T20:32:12.4132973Z         D: int,
2025-05-07T20:32:12.4133074Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4133216Z         contiguous: bool,
2025-05-07T20:32:12.4133308Z         compiled: bool,
2025-05-07T20:32:12.4133384Z     ) -> None:
2025-05-07T20:32:12.4133481Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4133560Z     
2025-05-07T20:32:12.4133729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4133815Z     
2025-05-07T20:32:12.4133908Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4134037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4134138Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4134285Z         x0 = x[:, :D]
2025-05-07T20:32:12.4134371Z         x1 = x[:, D:]
2025-05-07T20:32:12.4134471Z     
2025-05-07T20:32:12.4134638Z         if contiguous:
2025-05-07T20:32:12.4134740Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4134840Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4134913Z     
2025-05-07T20:32:12.4135005Z         if scale_ub is not None:
2025-05-07T20:32:12.4135119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4135259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4135339Z             )
2025-05-07T20:32:12.4135422Z         else:
2025-05-07T20:32:12.4135519Z             scale_ub_tensor = None
2025-05-07T20:32:12.4135599Z     
2025-05-07T20:32:12.4135728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4135820Z             op = silu_mul_quant
2025-05-07T20:32:12.4135917Z             if compiled:
2025-05-07T20:32:12.4136020Z                 op = torch.compile(op)
2025-05-07T20:32:12.4136134Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4136218Z     
2025-05-07T20:32:12.4136310Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4136320Z 
2025-05-07T20:32:12.4136421Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4136559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4136662Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4136772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4137267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4137367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4137731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4137954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4138292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4138397Z     kernel = self.compile(
2025-05-07T20:32:12.4138791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4138974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4139102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4139109Z 
2025-05-07T20:32:12.4139317Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc7d57b0>
2025-05-07T20:32:12.4140179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4140683Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc8a6b00>}
2025-05-07T20:32:12.4141450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4141643Z context = <triton._C.libtriton.ir.context object at 0x7f06aba90f30>
2025-05-07T20:32:12.4141697Z 
2025-05-07T20:32:12.4141869Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4142135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4142244Z                            module_map=module_map)
2025-05-07T20:32:12.4142431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4142533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4142616Z E       ^
2025-05-07T20:32:12.4148654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4148742Z 
2025-05-07T20:32:12.4149283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4149289Z 
2025-05-07T20:32:12.4149413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4149644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4149731Z     T=128,
2025-05-07T20:32:12.4149825Z     D=5120,
2025-05-07T20:32:12.4149912Z     scale_ub=None,
2025-05-07T20:32:12.4150005Z     contiguous=False,
2025-05-07T20:32:12.4150101Z     compiled=True,
2025-05-07T20:32:12.4150182Z )
2025-05-07T20:32:12.4150404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4150587Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4150592Z 
2025-05-07T20:32:12.4150676Z     @given(
2025-05-07T20:32:12.4150806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4150914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4151040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4151169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4151289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4151368Z     )
2025-05-07T20:32:12.4151628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4151731Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4151812Z         self,
2025-05-07T20:32:12.4151899Z         T: int,
2025-05-07T20:32:12.4151979Z         D: int,
2025-05-07T20:32:12.4152084Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4152187Z         contiguous: bool,
2025-05-07T20:32:12.4152277Z         compiled: bool,
2025-05-07T20:32:12.4152368Z     ) -> None:
2025-05-07T20:32:12.4152468Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4152547Z     
2025-05-07T20:32:12.4152733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4152814Z     
2025-05-07T20:32:12.4152913Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4153052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4153150Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4153238Z         x0 = x[:, :D]
2025-05-07T20:32:12.4153334Z         x1 = x[:, D:]
2025-05-07T20:32:12.4153414Z     
2025-05-07T20:32:12.4153507Z         if contiguous:
2025-05-07T20:32:12.4153613Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4153706Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4153797Z     
2025-05-07T20:32:12.4153895Z         if scale_ub is not None:
2025-05-07T20:32:12.4154006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4154157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4154238Z             )
2025-05-07T20:32:12.4154321Z         else:
2025-05-07T20:32:12.4154432Z             scale_ub_tensor = None
2025-05-07T20:32:12.4154514Z     
2025-05-07T20:32:12.4154650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4154760Z             op = silu_mul_quant
2025-05-07T20:32:12.4154852Z             if compiled:
2025-05-07T20:32:12.4154958Z                 op = torch.compile(op)
2025-05-07T20:32:12.4155080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4155162Z     
2025-05-07T20:32:12.4155341Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4155353Z 
2025-05-07T20:32:12.4155459Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4155593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4155711Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4155816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4156192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4156300Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4156984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4157098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4157462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4157698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4158055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4158156Z     kernel = self.compile(
2025-05-07T20:32:12.4158545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4158736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4158869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4158876Z 
2025-05-07T20:32:12.4159094Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc8c3c40>
2025-05-07T20:32:12.4159886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4160402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc56b370>}
2025-05-07T20:32:12.4161157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4161352Z context = <triton._C.libtriton.ir.context object at 0x7f06aba72cf0>
2025-05-07T20:32:12.4161356Z 
2025-05-07T20:32:12.4161539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4161812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4161925Z                            module_map=module_map)
2025-05-07T20:32:12.4162103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4162209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4162301Z E       ^
2025-05-07T20:32:12.4162660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4162665Z 
2025-05-07T20:32:12.4163091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4163095Z 
2025-05-07T20:32:12.4163215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4163443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4163536Z     T=128,
2025-05-07T20:32:12.4163617Z     D=7168,
2025-05-07T20:32:12.4163705Z     scale_ub=1200.0,
2025-05-07T20:32:12.4163806Z     contiguous=False,
2025-05-07T20:32:12.4163904Z     compiled=False,
2025-05-07T20:32:12.4163986Z )
2025-05-07T20:32:12.4164215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4164393Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4164446Z 
2025-05-07T20:32:12.4164529Z     @given(
2025-05-07T20:32:12.4164661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4164767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4164896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4165018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4165137Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4165226Z     )
2025-05-07T20:32:12.4165476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4166296Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4166383Z         self,
2025-05-07T20:32:12.4166542Z         T: int,
2025-05-07T20:32:12.4166624Z         D: int,
2025-05-07T20:32:12.4166740Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4166837Z         contiguous: bool,
2025-05-07T20:32:12.4166928Z         compiled: bool,
2025-05-07T20:32:12.4167021Z     ) -> None:
2025-05-07T20:32:12.4167125Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4167208Z     
2025-05-07T20:32:12.4167385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4167465Z     
2025-05-07T20:32:12.4167568Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4167698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4167793Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4167885Z         x0 = x[:, :D]
2025-05-07T20:32:12.4167972Z         x1 = x[:, D:]
2025-05-07T20:32:12.4168049Z     
2025-05-07T20:32:12.4168149Z         if contiguous:
2025-05-07T20:32:12.4168246Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4168344Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4168428Z     
2025-05-07T20:32:12.4168524Z         if scale_ub is not None:
2025-05-07T20:32:12.4168636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4168783Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4168867Z             )
2025-05-07T20:32:12.4168960Z         else:
2025-05-07T20:32:12.4169058Z             scale_ub_tensor = None
2025-05-07T20:32:12.4169138Z     
2025-05-07T20:32:12.4169281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4169380Z             op = silu_mul_quant
2025-05-07T20:32:12.4169472Z             if compiled:
2025-05-07T20:32:12.4169588Z                 op = torch.compile(op)
2025-05-07T20:32:12.4169696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4169777Z     
2025-05-07T20:32:12.4169882Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4169890Z 
2025-05-07T20:32:12.4169992Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4170133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4170247Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4170351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4170858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4170964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4171330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4171567Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4171916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4172015Z     kernel = self.compile(
2025-05-07T20:32:12.4172412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4172595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4172732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4172737Z 
2025-05-07T20:32:12.4172947Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bce8ba60>
2025-05-07T20:32:12.4173794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4174315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc56a560>}
2025-05-07T20:32:12.4175140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4175381Z context = <triton._C.libtriton.ir.context object at 0x7f06abad5ff0>
2025-05-07T20:32:12.4175385Z 
2025-05-07T20:32:12.4175553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4175833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4175944Z                            module_map=module_map)
2025-05-07T20:32:12.4176111Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4176222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4176303Z E       ^
2025-05-07T20:32:12.4176661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4176666Z 
2025-05-07T20:32:12.4177090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4177097Z 
2025-05-07T20:32:12.4177208Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4177440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4177521Z     T=128,
2025-05-07T20:32:12.4177602Z     D=5120,
2025-05-07T20:32:12.4177697Z     scale_ub=None,
2025-05-07T20:32:12.4177792Z     contiguous=False,
2025-05-07T20:32:12.4177883Z     compiled=False,
2025-05-07T20:32:12.4177967Z )
2025-05-07T20:32:12.4178186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4178360Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4178364Z 
2025-05-07T20:32:12.4178454Z     @given(
2025-05-07T20:32:12.4178578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4178687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4178809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4178930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4179060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4179139Z     )
2025-05-07T20:32:12.4179386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4179496Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4179580Z         self,
2025-05-07T20:32:12.4179663Z         T: int,
2025-05-07T20:32:12.4179749Z         D: int,
2025-05-07T20:32:12.4179991Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4180093Z         contiguous: bool,
2025-05-07T20:32:12.4180184Z         compiled: bool,
2025-05-07T20:32:12.4180266Z     ) -> None:
2025-05-07T20:32:12.4180370Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4180446Z     
2025-05-07T20:32:12.4180620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4180704Z     
2025-05-07T20:32:12.4180804Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4180932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4181041Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4181128Z         x0 = x[:, :D]
2025-05-07T20:32:12.4181213Z         x1 = x[:, D:]
2025-05-07T20:32:12.4181296Z     
2025-05-07T20:32:12.4181385Z         if contiguous:
2025-05-07T20:32:12.4181481Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4181637Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4181716Z     
2025-05-07T20:32:12.4181817Z         if scale_ub is not None:
2025-05-07T20:32:12.4181927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4182069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4182156Z             )
2025-05-07T20:32:12.4182236Z         else:
2025-05-07T20:32:12.4182334Z             scale_ub_tensor = None
2025-05-07T20:32:12.4182414Z     
2025-05-07T20:32:12.4182549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4182687Z             op = silu_mul_quant
2025-05-07T20:32:12.4182783Z             if compiled:
2025-05-07T20:32:12.4182985Z                 op = torch.compile(op)
2025-05-07T20:32:12.4183099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4183181Z     
2025-05-07T20:32:12.4183276Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4183280Z 
2025-05-07T20:32:12.4183388Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4183522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4183627Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4183739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4184352Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4184720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4184951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4185304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4185403Z     kernel = self.compile(
2025-05-07T20:32:12.4185788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4185978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4186108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4186113Z 
2025-05-07T20:32:12.4186330Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abbbbc40>
2025-05-07T20:32:12.4187107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4187624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc568550>}
2025-05-07T20:32:12.4188379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4188578Z context = <triton._C.libtriton.ir.context object at 0x7f06ab70d9f0>
2025-05-07T20:32:12.4188582Z 
2025-05-07T20:32:12.4188756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4189020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4189132Z                            module_map=module_map)
2025-05-07T20:32:12.4189306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4189414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4189503Z E       ^
2025-05-07T20:32:12.4190166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4190174Z 
2025-05-07T20:32:12.4190657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4190828Z 
2025-05-07T20:32:12.4190946Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4191169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4191247Z     T=128,
2025-05-07T20:32:12.4191336Z     D=5120,
2025-05-07T20:32:12.4191426Z     scale_ub=1200.0,
2025-05-07T20:32:12.4191517Z     contiguous=True,
2025-05-07T20:32:12.4191605Z     compiled=False,
2025-05-07T20:32:12.4191681Z )
2025-05-07T20:32:12.4191904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4192155Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4192159Z 
2025-05-07T20:32:12.4192364Z     @given(
2025-05-07T20:32:12.4192497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4192599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4192718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4192848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4192964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4193043Z     )
2025-05-07T20:32:12.4193294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4193391Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4193480Z         self,
2025-05-07T20:32:12.4193561Z         T: int,
2025-05-07T20:32:12.4193638Z         D: int,
2025-05-07T20:32:12.4193747Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4193841Z         contiguous: bool,
2025-05-07T20:32:12.4193936Z         compiled: bool,
2025-05-07T20:32:12.4194023Z     ) -> None:
2025-05-07T20:32:12.4194123Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4194195Z     
2025-05-07T20:32:12.4194368Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4194443Z     
2025-05-07T20:32:12.4194545Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4194669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4194764Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4194853Z         x0 = x[:, :D]
2025-05-07T20:32:12.4194936Z         x1 = x[:, D:]
2025-05-07T20:32:12.4195010Z     
2025-05-07T20:32:12.4195099Z         if contiguous:
2025-05-07T20:32:12.4195191Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4195282Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4195362Z     
2025-05-07T20:32:12.4195452Z         if scale_ub is not None:
2025-05-07T20:32:12.4195561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4195705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4195780Z             )
2025-05-07T20:32:12.4195865Z         else:
2025-05-07T20:32:12.4195968Z             scale_ub_tensor = None
2025-05-07T20:32:12.4196041Z     
2025-05-07T20:32:12.4196179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4196272Z             op = silu_mul_quant
2025-05-07T20:32:12.4196364Z             if compiled:
2025-05-07T20:32:12.4196472Z                 op = torch.compile(op)
2025-05-07T20:32:12.4196580Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4196653Z     
2025-05-07T20:32:12.4196752Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4196757Z 
2025-05-07T20:32:12.4196855Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4196985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4197093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4197193Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4197702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4197803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4198168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4198399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4198793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4198895Z     kernel = self.compile(
2025-05-07T20:32:12.4199283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4199457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4199592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4199637Z 
2025-05-07T20:32:12.4199846Z self = <triton.compiler.compiler.ASTSource object at 0x7f07bc02a8f0>
2025-05-07T20:32:12.4200716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4201233Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bc568700>}
2025-05-07T20:32:12.4201974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4202173Z context = <triton._C.libtriton.ir.context object at 0x7f06ab764f70>
2025-05-07T20:32:12.4202177Z 
2025-05-07T20:32:12.4202348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4202627Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4202738Z                            module_map=module_map)
2025-05-07T20:32:12.4202905Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4203013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4203093Z E       ^
2025-05-07T20:32:12.4203447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4203451Z 
2025-05-07T20:32:12.4203869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4203874Z 
2025-05-07T20:32:12.4203979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4204209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4204286Z     T=1,
2025-05-07T20:32:12.4204363Z     D=7168,
2025-05-07T20:32:12.4204453Z     scale_ub=1200.0,
2025-05-07T20:32:12.4204548Z     contiguous=True,
2025-05-07T20:32:12.4204633Z     compiled=True,
2025-05-07T20:32:12.4204717Z )
2025-05-07T20:32:12.4204934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4205103Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4205118Z 
2025-05-07T20:32:12.4205195Z     @given(
2025-05-07T20:32:12.4205319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4205423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4205544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4205663Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4205787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4205865Z     )
2025-05-07T20:32:12.4206110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4206218Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4206295Z         self,
2025-05-07T20:32:12.4206383Z         T: int,
2025-05-07T20:32:12.4206466Z         D: int,
2025-05-07T20:32:12.4206568Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4206666Z         contiguous: bool,
2025-05-07T20:32:12.4206756Z         compiled: bool,
2025-05-07T20:32:12.4206834Z     ) -> None:
2025-05-07T20:32:12.4207007Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4207079Z     
2025-05-07T20:32:12.4207250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4207327Z     
2025-05-07T20:32:12.4207421Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4207548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4207650Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4207731Z         x0 = x[:, :D]
2025-05-07T20:32:12.4207812Z         x1 = x[:, D:]
2025-05-07T20:32:12.4207897Z     
2025-05-07T20:32:12.4208031Z         if contiguous:
2025-05-07T20:32:12.4208133Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4208301Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4208376Z     
2025-05-07T20:32:12.4208484Z         if scale_ub is not None:
2025-05-07T20:32:12.4208591Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4208729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4208819Z             )
2025-05-07T20:32:12.4208896Z         else:
2025-05-07T20:32:12.4209007Z             scale_ub_tensor = None
2025-05-07T20:32:12.4209079Z     
2025-05-07T20:32:12.4209215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4209314Z             op = silu_mul_quant
2025-05-07T20:32:12.4209402Z             if compiled:
2025-05-07T20:32:12.4209504Z                 op = torch.compile(op)
2025-05-07T20:32:12.4209619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4209693Z     
2025-05-07T20:32:12.4209789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4209796Z 
2025-05-07T20:32:12.4209908Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4210044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4210155Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4210261Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4210634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4210742Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4211235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4211334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4211698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4211921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4212271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4212375Z     kernel = self.compile(
2025-05-07T20:32:12.4212764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4212952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4213081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4213086Z 
2025-05-07T20:32:12.4213301Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abbb9750>
2025-05-07T20:32:12.4214069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4214575Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07bcd18280>}
2025-05-07T20:32:12.4215333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4215524Z context = <triton._C.libtriton.ir.context object at 0x7f06abb75af0>
2025-05-07T20:32:12.4215576Z 
2025-05-07T20:32:12.4215749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4216013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4216124Z                            module_map=module_map)
2025-05-07T20:32:12.4216299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4216400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4216482Z E       ^
2025-05-07T20:32:12.4216885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4216988Z 
2025-05-07T20:32:12.4217409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4217413Z 
2025-05-07T20:32:12.4217525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4217749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4217828Z     T=1,
2025-05-07T20:32:12.4217912Z     D=7168,
2025-05-07T20:32:12.4217997Z     scale_ub=1200.0,
2025-05-07T20:32:12.4218090Z     contiguous=False,
2025-05-07T20:32:12.4218176Z     compiled=True,
2025-05-07T20:32:12.4218250Z )
2025-05-07T20:32:12.4218472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4218644Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4218653Z 
2025-05-07T20:32:12.4218734Z     @given(
2025-05-07T20:32:12.4218865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4218972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4219090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4219217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4219334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4219415Z     )
2025-05-07T20:32:12.4219660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4219756Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4219949Z         self,
2025-05-07T20:32:12.4220026Z         T: int,
2025-05-07T20:32:12.4220103Z         D: int,
2025-05-07T20:32:12.4220212Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4220301Z         contiguous: bool,
2025-05-07T20:32:12.4220390Z         compiled: bool,
2025-05-07T20:32:12.4220472Z     ) -> None:
2025-05-07T20:32:12.4220566Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4220642Z     
2025-05-07T20:32:12.4220824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4220899Z     
2025-05-07T20:32:12.4221001Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4221129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4221221Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4221309Z         x0 = x[:, :D]
2025-05-07T20:32:12.4221392Z         x1 = x[:, D:]
2025-05-07T20:32:12.4221463Z     
2025-05-07T20:32:12.4221551Z         if contiguous:
2025-05-07T20:32:12.4221645Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4221735Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4221815Z     
2025-05-07T20:32:12.4221907Z         if scale_ub is not None:
2025-05-07T20:32:12.4222015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4222161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4222238Z             )
2025-05-07T20:32:12.4222325Z         else:
2025-05-07T20:32:12.4222423Z             scale_ub_tensor = None
2025-05-07T20:32:12.4222498Z     
2025-05-07T20:32:12.4222640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4222734Z             op = silu_mul_quant
2025-05-07T20:32:12.4222822Z             if compiled:
2025-05-07T20:32:12.4222931Z                 op = torch.compile(op)
2025-05-07T20:32:12.4223042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4223177Z     
2025-05-07T20:32:12.4223280Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4223284Z 
2025-05-07T20:32:12.4223383Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4223512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4223631Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4223736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4224111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4224259Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4224832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4224942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4225306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4225542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4225892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4225995Z     kernel = self.compile(
2025-05-07T20:32:12.4226388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4226565Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4226696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4226701Z 
2025-05-07T20:32:12.4226920Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abbb85b0>
2025-05-07T20:32:12.4227689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4228195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf75b0>}
2025-05-07T20:32:12.4228937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4229138Z context = <triton._C.libtriton.ir.context object at 0x7f06abb68bb0>
2025-05-07T20:32:12.4229145Z 
2025-05-07T20:32:12.4229311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4229577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4229694Z                            module_map=module_map)
2025-05-07T20:32:12.4229858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4229959Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4230048Z E       ^
2025-05-07T20:32:12.4230402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4230407Z 
2025-05-07T20:32:12.4230826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4230830Z 
2025-05-07T20:32:12.4230938Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4231166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4231254Z     T=1,
2025-05-07T20:32:12.4231330Z     D=7168,
2025-05-07T20:32:12.4231419Z     scale_ub=None,
2025-05-07T20:32:12.4231517Z     contiguous=False,
2025-05-07T20:32:12.4231607Z     compiled=True,
2025-05-07T20:32:12.4231686Z )
2025-05-07T20:32:12.4231903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4232125Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4232130Z 
2025-05-07T20:32:12.4232211Z     @given(
2025-05-07T20:32:12.4232331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4232433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4232556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4232674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4232790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4232872Z     )
2025-05-07T20:32:12.4233167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4233266Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4233424Z         self,
2025-05-07T20:32:12.4233503Z         T: int,
2025-05-07T20:32:12.4233586Z         D: int,
2025-05-07T20:32:12.4233686Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4233777Z         contiguous: bool,
2025-05-07T20:32:12.4233874Z         compiled: bool,
2025-05-07T20:32:12.4233954Z     ) -> None:
2025-05-07T20:32:12.4234048Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4234127Z     
2025-05-07T20:32:12.4234295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4234367Z     
2025-05-07T20:32:12.4234467Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4234591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4234689Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4234770Z         x0 = x[:, :D]
2025-05-07T20:32:12.4234850Z         x1 = x[:, D:]
2025-05-07T20:32:12.4234930Z     
2025-05-07T20:32:12.4235016Z         if contiguous:
2025-05-07T20:32:12.4235115Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4235214Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4235285Z     
2025-05-07T20:32:12.4235376Z         if scale_ub is not None:
2025-05-07T20:32:12.4235491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4235626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4235704Z             )
2025-05-07T20:32:12.4235790Z         else:
2025-05-07T20:32:12.4235885Z             scale_ub_tensor = None
2025-05-07T20:32:12.4235963Z     
2025-05-07T20:32:12.4236096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4236185Z             op = silu_mul_quant
2025-05-07T20:32:12.4236277Z             if compiled:
2025-05-07T20:32:12.4236379Z                 op = torch.compile(op)
2025-05-07T20:32:12.4236488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4236569Z     
2025-05-07T20:32:12.4236662Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.4236791Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.4236873Z     
2025-05-07T20:32:12.4237013Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4237117Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.4237226Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.4237352Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.4237499Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4237575Z     
2025-05-07T20:32:12.4237676Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.4237681Z 
2025-05-07T20:32:12.4237787Z moe/activation_test.py:126: 
2025-05-07T20:32:12.4237915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4238022Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.4238165Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.4238728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.4238840Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.4239199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4239472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4239848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.4240103Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4240502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:12.4240758Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.4241260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.4241436Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.4241783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.4241862Z     fn()
2025-05-07T20:32:12.4242268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.4242352Z     self.fn.run(
2025-05-07T20:32:12.4242695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4242793Z     kernel = self.compile(
2025-05-07T20:32:12.4243176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4243367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4243500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4243504Z 
2025-05-07T20:32:12.4243710Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abb355a0>
2025-05-07T20:32:12.4244487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4244987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf5bd0>}
2025-05-07T20:32:12.4245734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4245931Z context = <triton._C.libtriton.ir.context object at 0x7f06ab630b70>
2025-05-07T20:32:12.4245936Z 
2025-05-07T20:32:12.4246112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4246380Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4246487Z                            module_map=module_map)
2025-05-07T20:32:12.4246661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4246763Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.4246841Z E       ^
2025-05-07T20:32:12.4247200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4247205Z 
2025-05-07T20:32:12.4247615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4247619Z 
2025-05-07T20:32:12.4247732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4247953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4248032Z     T=1,
2025-05-07T20:32:12.4248115Z     D=5120,
2025-05-07T20:32:12.4248200Z     scale_ub=1200.0,
2025-05-07T20:32:12.4248289Z     contiguous=False,
2025-05-07T20:32:12.4248378Z     compiled=True,
2025-05-07T20:32:12.4248451Z )
2025-05-07T20:32:12.4248725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4248896Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4248901Z 
2025-05-07T20:32:12.4248981Z     @given(
2025-05-07T20:32:12.4249107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4249212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4249330Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4249458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4249649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4249722Z     )
2025-05-07T20:32:12.4250050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4250146Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4250231Z         self,
2025-05-07T20:32:12.4250310Z         T: int,
2025-05-07T20:32:12.4250389Z         D: int,
2025-05-07T20:32:12.4250502Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4250595Z         contiguous: bool,
2025-05-07T20:32:12.4250680Z         compiled: bool,
2025-05-07T20:32:12.4250764Z     ) -> None:
2025-05-07T20:32:12.4250861Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4250934Z     
2025-05-07T20:32:12.4251111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4251184Z     
2025-05-07T20:32:12.4251277Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4251410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4251504Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4251595Z         x0 = x[:, :D]
2025-05-07T20:32:12.4251674Z         x1 = x[:, D:]
2025-05-07T20:32:12.4251751Z     
2025-05-07T20:32:12.4251842Z         if contiguous:
2025-05-07T20:32:12.4251936Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4252028Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4252105Z     
2025-05-07T20:32:12.4252197Z         if scale_ub is not None:
2025-05-07T20:32:12.4252309Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4252451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4252526Z             )
2025-05-07T20:32:12.4252603Z         else:
2025-05-07T20:32:12.4252705Z             scale_ub_tensor = None
2025-05-07T20:32:12.4252777Z     
2025-05-07T20:32:12.4252909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4253009Z             op = silu_mul_quant
2025-05-07T20:32:12.4253094Z             if compiled:
2025-05-07T20:32:12.4253202Z                 op = torch.compile(op)
2025-05-07T20:32:12.4253314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4253386Z     
2025-05-07T20:32:12.4253491Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4253495Z 
2025-05-07T20:32:12.4253600Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4253731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4253843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4253947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4254344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4254462Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4254964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4255072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4255434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4255662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4256010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4256107Z     kernel = self.compile(
2025-05-07T20:32:12.4256494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4256728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4256856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4256861Z 
2025-05-07T20:32:12.4257073Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab616230>
2025-05-07T20:32:12.4257844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4258475Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf43a0>}
2025-05-07T20:32:12.4259231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4259428Z context = <triton._C.libtriton.ir.context object at 0x7f06ab6a67b0>
2025-05-07T20:32:12.4259439Z 
2025-05-07T20:32:12.4259607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4259989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4260109Z                            module_map=module_map)
2025-05-07T20:32:12.4260278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4260379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4260470Z E       ^
2025-05-07T20:32:12.4260826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4260830Z 
2025-05-07T20:32:12.4261246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4261253Z 
2025-05-07T20:32:12.4261359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4261583Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4261667Z     T=1,
2025-05-07T20:32:12.4261746Z     D=5120,
2025-05-07T20:32:12.4261831Z     scale_ub=1200.0,
2025-05-07T20:32:12.4261927Z     contiguous=False,
2025-05-07T20:32:12.4262009Z     compiled=False,
2025-05-07T20:32:12.4262083Z )
2025-05-07T20:32:12.4262307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4262480Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4262490Z 
2025-05-07T20:32:12.4262570Z     @given(
2025-05-07T20:32:12.4262690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4262791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4262918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4263046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4263162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4263241Z     )
2025-05-07T20:32:12.4263487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4263582Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4263665Z         self,
2025-05-07T20:32:12.4263741Z         T: int,
2025-05-07T20:32:12.4263823Z         D: int,
2025-05-07T20:32:12.4263925Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4264018Z         contiguous: bool,
2025-05-07T20:32:12.4264109Z         compiled: bool,
2025-05-07T20:32:12.4264187Z     ) -> None:
2025-05-07T20:32:12.4264287Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4264366Z     
2025-05-07T20:32:12.4264535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4264606Z     
2025-05-07T20:32:12.4264705Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4264891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4264982Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4265069Z         x0 = x[:, :D]
2025-05-07T20:32:12.4265151Z         x1 = x[:, D:]
2025-05-07T20:32:12.4265228Z     
2025-05-07T20:32:12.4265312Z         if contiguous:
2025-05-07T20:32:12.4265407Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4265505Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4265577Z     
2025-05-07T20:32:12.4265670Z         if scale_ub is not None:
2025-05-07T20:32:12.4265782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4265964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4266116Z             )
2025-05-07T20:32:12.4266199Z         else:
2025-05-07T20:32:12.4266294Z             scale_ub_tensor = None
2025-05-07T20:32:12.4266366Z     
2025-05-07T20:32:12.4266506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4266599Z             op = silu_mul_quant
2025-05-07T20:32:12.4266687Z             if compiled:
2025-05-07T20:32:12.4266795Z                 op = torch.compile(op)
2025-05-07T20:32:12.4266903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4266979Z     
2025-05-07T20:32:12.4267073Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4267077Z 
2025-05-07T20:32:12.4267178Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4267313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4267417Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4267521Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4268028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4268129Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4268498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4268722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4269060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4269163Z     kernel = self.compile(
2025-05-07T20:32:12.4269544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4269719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4269851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4269858Z 
2025-05-07T20:32:12.4270069Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab6152d0>
2025-05-07T20:32:12.4270861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4271360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf4ee0>}
2025-05-07T20:32:12.4272118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4272309Z context = <triton._C.libtriton.ir.context object at 0x7f06ab6472f0>
2025-05-07T20:32:12.4272316Z 
2025-05-07T20:32:12.4272482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4272770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4272878Z                            module_map=module_map)
2025-05-07T20:32:12.4273053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4278804Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4278907Z E       ^
2025-05-07T20:32:12.4279287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4279293Z 
2025-05-07T20:32:12.4279713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4279718Z 
2025-05-07T20:32:12.4279826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4280057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4280224Z     T=16384,
2025-05-07T20:32:12.4280302Z     D=5120,
2025-05-07T20:32:12.4280397Z     scale_ub=1200.0,
2025-05-07T20:32:12.4280569Z     contiguous=False,
2025-05-07T20:32:12.4280663Z     compiled=True,
2025-05-07T20:32:12.4280741Z )
2025-05-07T20:32:12.4280960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4281147Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4281155Z 
2025-05-07T20:32:12.4281236Z     @given(
2025-05-07T20:32:12.4281361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4281472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4281589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4281708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4281832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4281911Z     )
2025-05-07T20:32:12.4282168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4282263Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4282343Z         self,
2025-05-07T20:32:12.4282429Z         T: int,
2025-05-07T20:32:12.4282505Z         D: int,
2025-05-07T20:32:12.4282608Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4282709Z         contiguous: bool,
2025-05-07T20:32:12.4282799Z         compiled: bool,
2025-05-07T20:32:12.4282885Z     ) -> None:
2025-05-07T20:32:12.4282989Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4283063Z     
2025-05-07T20:32:12.4283235Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4283315Z     
2025-05-07T20:32:12.4283410Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4283547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4283642Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4283726Z         x0 = x[:, :D]
2025-05-07T20:32:12.4283817Z         x1 = x[:, D:]
2025-05-07T20:32:12.4283896Z     
2025-05-07T20:32:12.4283984Z         if contiguous:
2025-05-07T20:32:12.4284088Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4284187Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4284262Z     
2025-05-07T20:32:12.4284371Z         if scale_ub is not None:
2025-05-07T20:32:12.4284482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4284620Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4284705Z             )
2025-05-07T20:32:12.4284787Z         else:
2025-05-07T20:32:12.4284896Z             scale_ub_tensor = None
2025-05-07T20:32:12.4284969Z     
2025-05-07T20:32:12.4285102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4285204Z             op = silu_mul_quant
2025-05-07T20:32:12.4285292Z             if compiled:
2025-05-07T20:32:12.4285398Z                 op = torch.compile(op)
2025-05-07T20:32:12.4285520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4285600Z     
2025-05-07T20:32:12.4285694Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4285699Z 
2025-05-07T20:32:12.4285809Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4285947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4286060Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4286164Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4286539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4286741Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4287237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4287340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4287710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4287936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4288464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4288563Z     kernel = self.compile(
2025-05-07T20:32:12.4288955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4289144Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4289275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4289280Z 
2025-05-07T20:32:12.4289490Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab6b3400>
2025-05-07T20:32:12.4290722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4291242Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf69e0>}
2025-05-07T20:32:12.4291996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4292195Z context = <triton._C.libtriton.ir.context object at 0x7f06ab3148f0>
2025-05-07T20:32:12.4292200Z 
2025-05-07T20:32:12.4292375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4292640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4292750Z                            module_map=module_map)
2025-05-07T20:32:12.4292924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4293029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4293109Z E       ^
2025-05-07T20:32:12.4293474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4293479Z 
2025-05-07T20:32:12.4293898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4293903Z 
2025-05-07T20:32:12.4294019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4294246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4294325Z     T=2048,
2025-05-07T20:32:12.4294411Z     D=7168,
2025-05-07T20:32:12.4294496Z     scale_ub=1200.0,
2025-05-07T20:32:12.4294584Z     contiguous=False,
2025-05-07T20:32:12.4294677Z     compiled=True,
2025-05-07T20:32:12.4294752Z )
2025-05-07T20:32:12.4294978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4295156Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4295163Z 
2025-05-07T20:32:12.4295241Z     @given(
2025-05-07T20:32:12.4295375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4295478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4295598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4295727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4296012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4296085Z     )
2025-05-07T20:32:12.4296338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4296435Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4296518Z         self,
2025-05-07T20:32:12.4296596Z         T: int,
2025-05-07T20:32:12.4296672Z         D: int,
2025-05-07T20:32:12.4296785Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4296878Z         contiguous: bool,
2025-05-07T20:32:12.4296966Z         compiled: bool,
2025-05-07T20:32:12.4297208Z     ) -> None:
2025-05-07T20:32:12.4297304Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4297377Z     
2025-05-07T20:32:12.4297672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4297751Z     
2025-05-07T20:32:12.4297846Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4297980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4298072Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4298170Z         x0 = x[:, :D]
2025-05-07T20:32:12.4298252Z         x1 = x[:, D:]
2025-05-07T20:32:12.4298324Z     
2025-05-07T20:32:12.4298417Z         if contiguous:
2025-05-07T20:32:12.4298512Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4298604Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4298690Z     
2025-05-07T20:32:12.4298784Z         if scale_ub is not None:
2025-05-07T20:32:12.4298892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4299039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4299119Z             )
2025-05-07T20:32:12.4299198Z         else:
2025-05-07T20:32:12.4299306Z             scale_ub_tensor = None
2025-05-07T20:32:12.4299384Z     
2025-05-07T20:32:12.4299527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4299622Z             op = silu_mul_quant
2025-05-07T20:32:12.4299709Z             if compiled:
2025-05-07T20:32:12.4299904Z                 op = torch.compile(op)
2025-05-07T20:32:12.4300022Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4300096Z     
2025-05-07T20:32:12.4300196Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4300200Z 
2025-05-07T20:32:12.4300298Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4300426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4300534Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4300634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4301001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4301107Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4301616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4301730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4302094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4302323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4302672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4302767Z     kernel = self.compile(
2025-05-07T20:32:12.4303156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4303345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4303475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4303484Z 
2025-05-07T20:32:12.4303703Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab373d60>
2025-05-07T20:32:12.4304483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4305047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abcf7b50>}
2025-05-07T20:32:12.4305811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4306005Z context = <triton._C.libtriton.ir.context object at 0x7f06ab3db2b0>
2025-05-07T20:32:12.4306054Z 
2025-05-07T20:32:12.4306303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4306573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4306690Z                            module_map=module_map)
2025-05-07T20:32:12.4306855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4306958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4307048Z E       ^
2025-05-07T20:32:12.4307413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4307418Z 
2025-05-07T20:32:12.4307831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4307836Z 
2025-05-07T20:32:12.4307958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4308184Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4308268Z     T=1,
2025-05-07T20:32:12.4308356Z     D=5120,
2025-05-07T20:32:12.4308442Z     scale_ub=None,
2025-05-07T20:32:12.4308542Z     contiguous=False,
2025-05-07T20:32:12.4308629Z     compiled=False,
2025-05-07T20:32:12.4308701Z )
2025-05-07T20:32:12.4308924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4309097Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4309101Z 
2025-05-07T20:32:12.4309182Z     @given(
2025-05-07T20:32:12.4309311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4309412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4309535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4309658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4309776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4309856Z     )
2025-05-07T20:32:12.4310107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4310207Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4310289Z         self,
2025-05-07T20:32:12.4310367Z         T: int,
2025-05-07T20:32:12.4310446Z         D: int,
2025-05-07T20:32:12.4310556Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4310651Z         contiguous: bool,
2025-05-07T20:32:12.4310737Z         compiled: bool,
2025-05-07T20:32:12.4310823Z     ) -> None:
2025-05-07T20:32:12.4310923Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4311002Z     
2025-05-07T20:32:12.4311170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4311245Z     
2025-05-07T20:32:12.4311348Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4311474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4311566Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4311654Z         x0 = x[:, :D]
2025-05-07T20:32:12.4311739Z         x1 = x[:, D:]
2025-05-07T20:32:12.4311812Z     
2025-05-07T20:32:12.4311902Z         if contiguous:
2025-05-07T20:32:12.4312002Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4312092Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4312177Z     
2025-05-07T20:32:12.4312270Z         if scale_ub is not None:
2025-05-07T20:32:12.4312384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4312573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4312652Z             )
2025-05-07T20:32:12.4312735Z         else:
2025-05-07T20:32:12.4312832Z             scale_ub_tensor = None
2025-05-07T20:32:12.4312903Z     
2025-05-07T20:32:12.4313039Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4313131Z             op = silu_mul_quant
2025-05-07T20:32:12.4313218Z             if compiled:
2025-05-07T20:32:12.4313326Z                 op = torch.compile(op)
2025-05-07T20:32:12.4313435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4313554Z     
2025-05-07T20:32:12.4313655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4313771Z 
2025-05-07T20:32:12.4313873Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4314011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4314116Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4314217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4314733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4314835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4315192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4315424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4315766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4315871Z     kernel = self.compile(
2025-05-07T20:32:12.4316260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4316437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4316572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4316579Z 
2025-05-07T20:32:12.4316787Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab33f580>
2025-05-07T20:32:12.4317584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4318084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe945e0>}
2025-05-07T20:32:12.4318847Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4319048Z context = <triton._C.libtriton.ir.context object at 0x7f06abe9d230>
2025-05-07T20:32:12.4319053Z 
2025-05-07T20:32:12.4319227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4319496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4319605Z                            module_map=module_map)
2025-05-07T20:32:12.4319770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4319878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4319957Z E       ^
2025-05-07T20:32:12.4320327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4320335Z 
2025-05-07T20:32:12.4320754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4320758Z 
2025-05-07T20:32:12.4320865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4321094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4321220Z     T=4096,
2025-05-07T20:32:12.4321300Z     D=7168,
2025-05-07T20:32:12.4321392Z     scale_ub=1200.0,
2025-05-07T20:32:12.4321478Z     contiguous=False,
2025-05-07T20:32:12.4321572Z     compiled=False,
2025-05-07T20:32:12.4321644Z )
2025-05-07T20:32:12.4321862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4322045Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4322050Z 
2025-05-07T20:32:12.4322131Z     @given(
2025-05-07T20:32:12.4322252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4322405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4322603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4322723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4322846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4322925Z     )
2025-05-07T20:32:12.4323181Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4323280Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4323357Z         self,
2025-05-07T20:32:12.4323443Z         T: int,
2025-05-07T20:32:12.4323520Z         D: int,
2025-05-07T20:32:12.4323626Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4323723Z         contiguous: bool,
2025-05-07T20:32:12.4323809Z         compiled: bool,
2025-05-07T20:32:12.4323893Z     ) -> None:
2025-05-07T20:32:12.4323997Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4324086Z     
2025-05-07T20:32:12.4324286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4324377Z     
2025-05-07T20:32:12.4324477Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4324609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4324701Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4324784Z         x0 = x[:, :D]
2025-05-07T20:32:12.4324873Z         x1 = x[:, D:]
2025-05-07T20:32:12.4324950Z     
2025-05-07T20:32:12.4325034Z         if contiguous:
2025-05-07T20:32:12.4325134Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4325225Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4325298Z     
2025-05-07T20:32:12.4325397Z         if scale_ub is not None:
2025-05-07T20:32:12.4325504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4325638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4325725Z             )
2025-05-07T20:32:12.4325800Z         else:
2025-05-07T20:32:12.4325899Z             scale_ub_tensor = None
2025-05-07T20:32:12.4325983Z     
2025-05-07T20:32:12.4326117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4326218Z             op = silu_mul_quant
2025-05-07T20:32:12.4326307Z             if compiled:
2025-05-07T20:32:12.4326410Z                 op = torch.compile(op)
2025-05-07T20:32:12.4326526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4326601Z     
2025-05-07T20:32:12.4326696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4326701Z 
2025-05-07T20:32:12.4326805Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4326935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4327038Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4327147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4327652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4327759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4328125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4328347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4328699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4328843Z     kernel = self.compile(
2025-05-07T20:32:12.4329238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4329414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4329540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4329544Z 
2025-05-07T20:32:12.4329756Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abe65d20>
2025-05-07T20:32:12.4330620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4331173Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe94ca0>}
2025-05-07T20:32:12.4331917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4332108Z context = <triton._C.libtriton.ir.context object at 0x7f06abe744b0>
2025-05-07T20:32:12.4332113Z 
2025-05-07T20:32:12.4332288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4332551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4332668Z                            module_map=module_map)
2025-05-07T20:32:12.4332835Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4332943Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4333029Z E       ^
2025-05-07T20:32:12.4333389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4333397Z 
2025-05-07T20:32:12.4333809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4333822Z 
2025-05-07T20:32:12.4333929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4334150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4334234Z     T=16384,
2025-05-07T20:32:12.4334312Z     D=7168,
2025-05-07T20:32:12.4334396Z     scale_ub=None,
2025-05-07T20:32:12.4334490Z     contiguous=True,
2025-05-07T20:32:12.4334573Z     compiled=True,
2025-05-07T20:32:12.4334651Z )
2025-05-07T20:32:12.4334876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4335058Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4335062Z 
2025-05-07T20:32:12.4335145Z     @given(
2025-05-07T20:32:12.4335267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4335369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4335495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4335620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4335736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4335815Z     )
2025-05-07T20:32:12.4336067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4336163Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4336249Z         self,
2025-05-07T20:32:12.4336325Z         T: int,
2025-05-07T20:32:12.4336402Z         D: int,
2025-05-07T20:32:12.4336511Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4336605Z         contiguous: bool,
2025-05-07T20:32:12.4336705Z         compiled: bool,
2025-05-07T20:32:12.4336784Z     ) -> None:
2025-05-07T20:32:12.4336879Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4336960Z     
2025-05-07T20:32:12.4337127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4337252Z     
2025-05-07T20:32:12.4337354Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4337482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4337576Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4337662Z         x0 = x[:, :D]
2025-05-07T20:32:12.4337746Z         x1 = x[:, D:]
2025-05-07T20:32:12.4337819Z     
2025-05-07T20:32:12.4337909Z         if contiguous:
2025-05-07T20:32:12.4338002Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4338103Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4338177Z     
2025-05-07T20:32:12.4338315Z         if scale_ub is not None:
2025-05-07T20:32:12.4338449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4338669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4338755Z             )
2025-05-07T20:32:12.4338834Z         else:
2025-05-07T20:32:12.4338932Z             scale_ub_tensor = None
2025-05-07T20:32:12.4339011Z     
2025-05-07T20:32:12.4339146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4339239Z             op = silu_mul_quant
2025-05-07T20:32:12.4339334Z             if compiled:
2025-05-07T20:32:12.4339440Z                 op = torch.compile(op)
2025-05-07T20:32:12.4339550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4339633Z     
2025-05-07T20:32:12.4339729Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4339733Z 
2025-05-07T20:32:12.4339972Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4340109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4340214Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4340320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4340694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4340790Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4341290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4341397Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4341756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4341990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4342333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4342441Z     kernel = self.compile(
2025-05-07T20:32:12.4342833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4343014Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4343149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4343153Z 
2025-05-07T20:32:12.4343361Z self = <triton.compiler.compiler.ASTSource object at 0x7f06abed3550>
2025-05-07T20:32:12.4344144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4344653Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe95b40>}
2025-05-07T20:32:12.4345400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4345603Z context = <triton._C.libtriton.ir.context object at 0x7f06ab428df0>
2025-05-07T20:32:12.4345608Z 
2025-05-07T20:32:12.4345774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4346047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4346270Z                            module_map=module_map)
2025-05-07T20:32:12.4346436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4346551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4346628Z E       ^
2025-05-07T20:32:12.4346985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4346990Z 
2025-05-07T20:32:12.4347402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4347447Z 
2025-05-07T20:32:12.4347660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4347888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4347968Z     T=4096,
2025-05-07T20:32:12.4348044Z     D=5120,
2025-05-07T20:32:12.4348133Z     scale_ub=None,
2025-05-07T20:32:12.4348227Z     contiguous=False,
2025-05-07T20:32:12.4348318Z     compiled=True,
2025-05-07T20:32:12.4348391Z )
2025-05-07T20:32:12.4348607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4348787Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4348791Z 
2025-05-07T20:32:12.4348866Z     @given(
2025-05-07T20:32:12.4348986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4349095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4349215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4349335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4349462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4349539Z     )
2025-05-07T20:32:12.4349789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4349885Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4349963Z         self,
2025-05-07T20:32:12.4350049Z         T: int,
2025-05-07T20:32:12.4350125Z         D: int,
2025-05-07T20:32:12.4350225Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4350325Z         contiguous: bool,
2025-05-07T20:32:12.4350411Z         compiled: bool,
2025-05-07T20:32:12.4350491Z     ) -> None:
2025-05-07T20:32:12.4350595Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4350668Z     
2025-05-07T20:32:12.4350838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4350920Z     
2025-05-07T20:32:12.4351016Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4351148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4351243Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4351326Z         x0 = x[:, :D]
2025-05-07T20:32:12.4351414Z         x1 = x[:, D:]
2025-05-07T20:32:12.4351486Z     
2025-05-07T20:32:12.4351571Z         if contiguous:
2025-05-07T20:32:12.4351670Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4351766Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4351838Z     
2025-05-07T20:32:12.4351938Z         if scale_ub is not None:
2025-05-07T20:32:12.4352044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4352181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4352267Z             )
2025-05-07T20:32:12.4352343Z         else:
2025-05-07T20:32:12.4352446Z             scale_ub_tensor = None
2025-05-07T20:32:12.4352518Z     
2025-05-07T20:32:12.4352647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4352747Z             op = silu_mul_quant
2025-05-07T20:32:12.4352836Z             if compiled:
2025-05-07T20:32:12.4352943Z                 op = torch.compile(op)
2025-05-07T20:32:12.4353057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4353132Z     
2025-05-07T20:32:12.4353226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4353230Z 
2025-05-07T20:32:12.4353338Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4353526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4353632Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4353740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4354133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4354246Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4354751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4354895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4355338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4355568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4355921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4356021Z     kernel = self.compile(
2025-05-07T20:32:12.4356408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4356593Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4356720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4356725Z 
2025-05-07T20:32:12.4356931Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab41d060>
2025-05-07T20:32:12.4357716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4358223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe95240>}
2025-05-07T20:32:12.4358975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4359166Z context = <triton._C.libtriton.ir.context object at 0x7f06ab44f970>
2025-05-07T20:32:12.4359171Z 
2025-05-07T20:32:12.4359343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4359607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4359720Z                            module_map=module_map)
2025-05-07T20:32:12.4359896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4360001Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4360078Z E       ^
2025-05-07T20:32:12.4360439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4360446Z 
2025-05-07T20:32:12.4360858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4360862Z 
2025-05-07T20:32:12.4360976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4361199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4361278Z     T=4096,
2025-05-07T20:32:12.4361361Z     D=5120,
2025-05-07T20:32:12.4361447Z     scale_ub=1200.0,
2025-05-07T20:32:12.4361534Z     contiguous=False,
2025-05-07T20:32:12.4361630Z     compiled=False,
2025-05-07T20:32:12.4361707Z )
2025-05-07T20:32:12.4361934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4362114Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4362118Z 
2025-05-07T20:32:12.4362198Z     @given(
2025-05-07T20:32:12.4362323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4362473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4362592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4362717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4362833Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4362905Z     )
2025-05-07T20:32:12.4363156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4363251Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4363331Z         self,
2025-05-07T20:32:12.4363448Z         T: int,
2025-05-07T20:32:12.4363522Z         D: int,
2025-05-07T20:32:12.4363630Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4363797Z         contiguous: bool,
2025-05-07T20:32:12.4363886Z         compiled: bool,
2025-05-07T20:32:12.4363972Z     ) -> None:
2025-05-07T20:32:12.4364067Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4364139Z     
2025-05-07T20:32:12.4364317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4364394Z     
2025-05-07T20:32:12.4364488Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4364619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4364710Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4364796Z         x0 = x[:, :D]
2025-05-07T20:32:12.4364878Z         x1 = x[:, D:]
2025-05-07T20:32:12.4364953Z     
2025-05-07T20:32:12.4365044Z         if contiguous:
2025-05-07T20:32:12.4365138Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4365230Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4365311Z     
2025-05-07T20:32:12.4365405Z         if scale_ub is not None:
2025-05-07T20:32:12.4365516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4365657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4365734Z             )
2025-05-07T20:32:12.4365812Z         else:
2025-05-07T20:32:12.4365916Z             scale_ub_tensor = None
2025-05-07T20:32:12.4365992Z     
2025-05-07T20:32:12.4366125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4366224Z             op = silu_mul_quant
2025-05-07T20:32:12.4366311Z             if compiled:
2025-05-07T20:32:12.4366420Z                 op = torch.compile(op)
2025-05-07T20:32:12.4366529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4366600Z     
2025-05-07T20:32:12.4366698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4366702Z 
2025-05-07T20:32:12.4366806Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4366934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4367047Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4367156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4367665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4367766Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4368132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4368359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4368703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4368801Z     kernel = self.compile(
2025-05-07T20:32:12.4369190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4369368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4369505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4369509Z 
2025-05-07T20:32:12.4369717Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab33f790>
2025-05-07T20:32:12.4370488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4371067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe96cb0>}
2025-05-07T20:32:12.4371809Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4372049Z context = <triton._C.libtriton.ir.context object at 0x7f06ab4f8d70>
2025-05-07T20:32:12.4372132Z 
2025-05-07T20:32:12.4372306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4372572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4372682Z                            module_map=module_map)
2025-05-07T20:32:12.4372847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4372956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4373034Z E       ^
2025-05-07T20:32:12.4373386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4373391Z 
2025-05-07T20:32:12.4373808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4373819Z 
2025-05-07T20:32:12.4373927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4374159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4374240Z     T=4096,
2025-05-07T20:32:12.4374317Z     D=5120,
2025-05-07T20:32:12.4374408Z     scale_ub=1200.0,
2025-05-07T20:32:12.4374495Z     contiguous=False,
2025-05-07T20:32:12.4374580Z     compiled=True,
2025-05-07T20:32:12.4374661Z )
2025-05-07T20:32:12.4374877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4375052Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4375056Z 
2025-05-07T20:32:12.4375139Z     @given(
2025-05-07T20:32:12.4375259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4375369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4375486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4375606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4375731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4375808Z     )
2025-05-07T20:32:12.4376058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4376161Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4376236Z         self,
2025-05-07T20:32:12.4376314Z         T: int,
2025-05-07T20:32:12.4376401Z         D: int,
2025-05-07T20:32:12.4376505Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4376604Z         contiguous: bool,
2025-05-07T20:32:12.4376691Z         compiled: bool,
2025-05-07T20:32:12.4376769Z     ) -> None:
2025-05-07T20:32:12.4376870Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4376942Z     
2025-05-07T20:32:12.4377113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4377192Z     
2025-05-07T20:32:12.4377286Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4377412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4377513Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4377594Z         x0 = x[:, :D]
2025-05-07T20:32:12.4377676Z         x1 = x[:, D:]
2025-05-07T20:32:12.4377761Z     
2025-05-07T20:32:12.4377845Z         if contiguous:
2025-05-07T20:32:12.4377939Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4378041Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4378114Z     
2025-05-07T20:32:12.4378212Z         if scale_ub is not None:
2025-05-07T20:32:12.4378376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4378512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4378592Z             )
2025-05-07T20:32:12.4378672Z         else:
2025-05-07T20:32:12.4378770Z             scale_ub_tensor = None
2025-05-07T20:32:12.4378850Z     
2025-05-07T20:32:12.4378985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4379076Z             op = silu_mul_quant
2025-05-07T20:32:12.4379172Z             if compiled:
2025-05-07T20:32:12.4379352Z                 op = torch.compile(op)
2025-05-07T20:32:12.4379460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4379618Z     
2025-05-07T20:32:12.4379715Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4379719Z 
2025-05-07T20:32:12.4379960Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4380093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4380200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4380304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4380674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4380772Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4381269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4381368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4381729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4381963Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4382307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4382410Z     kernel = self.compile(
2025-05-07T20:32:12.4382800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4382977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4383111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4383115Z 
2025-05-07T20:32:12.4383322Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab4f05e0>
2025-05-07T20:32:12.4384116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4384623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06abe96b90>}
2025-05-07T20:32:12.4385373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4385568Z context = <triton._C.libtriton.ir.context object at 0x7f06ab5a52f0>
2025-05-07T20:32:12.4385573Z 
2025-05-07T20:32:12.4385738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4386005Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4386115Z                            module_map=module_map)
2025-05-07T20:32:12.4386294Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4386395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4386480Z E       ^
2025-05-07T20:32:12.4386839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4386843Z 
2025-05-07T20:32:12.4387256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4387310Z 
2025-05-07T20:32:12.4387417Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4387645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4387726Z     T=2048,
2025-05-07T20:32:12.4387813Z     D=7168,
2025-05-07T20:32:12.4387899Z     scale_ub=1200.0,
2025-05-07T20:32:12.4387988Z     contiguous=False,
2025-05-07T20:32:12.4388079Z     compiled=False,
2025-05-07T20:32:12.4388149Z )
2025-05-07T20:32:12.4388409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4388667Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4388672Z 
2025-05-07T20:32:12.4388751Z     @given(
2025-05-07T20:32:12.4388871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4388979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4389101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4389224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4389343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4389416Z     )
2025-05-07T20:32:12.4389666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4389763Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4390135Z         self,
2025-05-07T20:32:12.4390262Z         T: int,
2025-05-07T20:32:12.4390372Z         D: int,
2025-05-07T20:32:12.4390508Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4390613Z         contiguous: bool,
2025-05-07T20:32:12.4390704Z         compiled: bool,
2025-05-07T20:32:12.4390789Z     ) -> None:
2025-05-07T20:32:12.4390891Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4390964Z     
2025-05-07T20:32:12.4391143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4391219Z     
2025-05-07T20:32:12.4391316Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4391447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4391540Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4391623Z         x0 = x[:, :D]
2025-05-07T20:32:12.4391710Z         x1 = x[:, D:]
2025-05-07T20:32:12.4391784Z     
2025-05-07T20:32:12.4391869Z         if contiguous:
2025-05-07T20:32:12.4391968Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4392059Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4392133Z     
2025-05-07T20:32:12.4392233Z         if scale_ub is not None:
2025-05-07T20:32:12.4392342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4392482Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4392563Z             )
2025-05-07T20:32:12.4392642Z         else:
2025-05-07T20:32:12.4392746Z             scale_ub_tensor = None
2025-05-07T20:32:12.4392818Z     
2025-05-07T20:32:12.4392949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4393050Z             op = silu_mul_quant
2025-05-07T20:32:12.4393136Z             if compiled:
2025-05-07T20:32:12.4393237Z                 op = torch.compile(op)
2025-05-07T20:32:12.4393353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4393426Z     
2025-05-07T20:32:12.4393519Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4393529Z 
2025-05-07T20:32:12.4393628Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4393757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4393867Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4393972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4394477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4394583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4394948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4395344Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4395698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4395795Z     kernel = self.compile(
2025-05-07T20:32:12.4396183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4396360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4396560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4396564Z 
2025-05-07T20:32:12.4396899Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab567df0>
2025-05-07T20:32:12.4397684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4398203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5dc5e0>}
2025-05-07T20:32:12.4398942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4399141Z context = <triton._C.libtriton.ir.context object at 0x7f06ab5f6730>
2025-05-07T20:32:12.4399148Z 
2025-05-07T20:32:12.4399318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4399591Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4399707Z                            module_map=module_map)
2025-05-07T20:32:12.4399870Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4399973Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4400056Z E       ^
2025-05-07T20:32:12.4400414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4400418Z 
2025-05-07T20:32:12.4400837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4400842Z 
2025-05-07T20:32:12.4400950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4401172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4401262Z     T=1,
2025-05-07T20:32:12.4401340Z     D=7168,
2025-05-07T20:32:12.4401428Z     scale_ub=None,
2025-05-07T20:32:12.4401522Z     contiguous=True,
2025-05-07T20:32:12.4401607Z     compiled=False,
2025-05-07T20:32:12.4401702Z )
2025-05-07T20:32:12.4401919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4402087Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4402091Z 
2025-05-07T20:32:12.4402179Z     @given(
2025-05-07T20:32:12.4407975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4408106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4408245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4408369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4408487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4408577Z     )
2025-05-07T20:32:12.4408835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4408938Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4409025Z         self,
2025-05-07T20:32:12.4409105Z         T: int,
2025-05-07T20:32:12.4409189Z         D: int,
2025-05-07T20:32:12.4409291Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4409388Z         contiguous: bool,
2025-05-07T20:32:12.4409564Z         compiled: bool,
2025-05-07T20:32:12.4409646Z     ) -> None:
2025-05-07T20:32:12.4409744Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4409831Z     
2025-05-07T20:32:12.4410007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4410082Z     
2025-05-07T20:32:12.4410186Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4410312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4410404Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4410491Z         x0 = x[:, :D]
2025-05-07T20:32:12.4410624Z         x1 = x[:, D:]
2025-05-07T20:32:12.4410698Z     
2025-05-07T20:32:12.4410793Z         if contiguous:
2025-05-07T20:32:12.4410967Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4411071Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4411147Z     
2025-05-07T20:32:12.4411240Z         if scale_ub is not None:
2025-05-07T20:32:12.4411354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4411494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4411579Z             )
2025-05-07T20:32:12.4411665Z         else:
2025-05-07T20:32:12.4411763Z             scale_ub_tensor = None
2025-05-07T20:32:12.4411837Z     
2025-05-07T20:32:12.4411983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4412077Z             op = silu_mul_quant
2025-05-07T20:32:12.4412168Z             if compiled:
2025-05-07T20:32:12.4412281Z                 op = torch.compile(op)
2025-05-07T20:32:12.4412392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4412480Z     
2025-05-07T20:32:12.4412576Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4412581Z 
2025-05-07T20:32:12.4412691Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4412833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4412940Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4413043Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4413558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4413663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4414032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4414262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4414605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4414714Z     kernel = self.compile(
2025-05-07T20:32:12.4415109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4415288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4415426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4415434Z 
2025-05-07T20:32:12.4415642Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab4c4820>
2025-05-07T20:32:12.4416425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4416927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5dcd30>}
2025-05-07T20:32:12.4417688Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4417884Z context = <triton._C.libtriton.ir.context object at 0x7f06ab9e6330>
2025-05-07T20:32:12.4417888Z 
2025-05-07T20:32:12.4418057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4418419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4418533Z                            module_map=module_map)
2025-05-07T20:32:12.4418709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4418813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4418891Z E       ^
2025-05-07T20:32:12.4419256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4419302Z 
2025-05-07T20:32:12.4419919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4419925Z 
2025-05-07T20:32:12.4420034Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4420269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4420350Z     T=16384,
2025-05-07T20:32:12.4420441Z     D=7168,
2025-05-07T20:32:12.4420529Z     scale_ub=1200.0,
2025-05-07T20:32:12.4420619Z     contiguous=False,
2025-05-07T20:32:12.4420713Z     compiled=True,
2025-05-07T20:32:12.4420793Z )
2025-05-07T20:32:12.4421012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4421201Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4421206Z 
2025-05-07T20:32:12.4421284Z     @given(
2025-05-07T20:32:12.4421406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4421516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4421635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4421769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4421890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4421964Z     )
2025-05-07T20:32:12.4422220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4422322Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4422399Z         self,
2025-05-07T20:32:12.4422487Z         T: int,
2025-05-07T20:32:12.4422566Z         D: int,
2025-05-07T20:32:12.4422668Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4422767Z         contiguous: bool,
2025-05-07T20:32:12.4422855Z         compiled: bool,
2025-05-07T20:32:12.4422934Z     ) -> None:
2025-05-07T20:32:12.4423041Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4423114Z     
2025-05-07T20:32:12.4423294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4423373Z     
2025-05-07T20:32:12.4423469Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4423610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4423702Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4423784Z         x0 = x[:, :D]
2025-05-07T20:32:12.4423876Z         x1 = x[:, D:]
2025-05-07T20:32:12.4423947Z     
2025-05-07T20:32:12.4424037Z         if contiguous:
2025-05-07T20:32:12.4424137Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4424233Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4424306Z     
2025-05-07T20:32:12.4424406Z         if scale_ub is not None:
2025-05-07T20:32:12.4424513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4424658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4424737Z             )
2025-05-07T20:32:12.4424816Z         else:
2025-05-07T20:32:12.4424921Z             scale_ub_tensor = None
2025-05-07T20:32:12.4425000Z     
2025-05-07T20:32:12.4425133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4425231Z             op = silu_mul_quant
2025-05-07T20:32:12.4425327Z             if compiled:
2025-05-07T20:32:12.4425430Z                 op = torch.compile(op)
2025-05-07T20:32:12.4425549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4425624Z     
2025-05-07T20:32:12.4425720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4425775Z 
2025-05-07T20:32:12.4425883Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4426013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4426126Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4426230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4426604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4426710Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4427211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4427425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4427799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4428024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4428381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4428481Z     kernel = self.compile(
2025-05-07T20:32:12.4428862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4429050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4429176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4429181Z 
2025-05-07T20:32:12.4429393Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab584c10>
2025-05-07T20:32:12.4430176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4430676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5ddbd0>}
2025-05-07T20:32:12.4431428Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4431620Z context = <triton._C.libtriton.ir.context object at 0x7f06ab9528f0>
2025-05-07T20:32:12.4431624Z 
2025-05-07T20:32:12.4431798Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4432064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4432177Z                            module_map=module_map)
2025-05-07T20:32:12.4432347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4432450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4432526Z E       ^
2025-05-07T20:32:12.4432891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4432896Z 
2025-05-07T20:32:12.4433308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4433312Z 
2025-05-07T20:32:12.4433423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4433643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4433725Z     T=1,
2025-05-07T20:32:12.4433811Z     D=7168,
2025-05-07T20:32:12.4433899Z     scale_ub=None,
2025-05-07T20:32:12.4433989Z     contiguous=False,
2025-05-07T20:32:12.4434085Z     compiled=False,
2025-05-07T20:32:12.4434167Z )
2025-05-07T20:32:12.4434388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4434565Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4434569Z 
2025-05-07T20:32:12.4434780Z     @given(
2025-05-07T20:32:12.4434911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4435013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4435130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4435256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4435372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4435449Z     )
2025-05-07T20:32:12.4435703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4435841Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4435920Z         self,
2025-05-07T20:32:12.4436007Z         T: int,
2025-05-07T20:32:12.4436154Z         D: int,
2025-05-07T20:32:12.4436267Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4436360Z         contiguous: bool,
2025-05-07T20:32:12.4436446Z         compiled: bool,
2025-05-07T20:32:12.4436534Z     ) -> None:
2025-05-07T20:32:12.4436633Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4436712Z     
2025-05-07T20:32:12.4436890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4436965Z     
2025-05-07T20:32:12.4437059Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4437194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4437285Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4437366Z         x0 = x[:, :D]
2025-05-07T20:32:12.4437453Z         x1 = x[:, D:]
2025-05-07T20:32:12.4437526Z     
2025-05-07T20:32:12.4437615Z         if contiguous:
2025-05-07T20:32:12.4437714Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4437804Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4437883Z     
2025-05-07T20:32:12.4437980Z         if scale_ub is not None:
2025-05-07T20:32:12.4438085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4438226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4438307Z             )
2025-05-07T20:32:12.4438390Z         else:
2025-05-07T20:32:12.4438494Z             scale_ub_tensor = None
2025-05-07T20:32:12.4438564Z     
2025-05-07T20:32:12.4438695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4438794Z             op = silu_mul_quant
2025-05-07T20:32:12.4438882Z             if compiled:
2025-05-07T20:32:12.4438989Z                 op = torch.compile(op)
2025-05-07T20:32:12.4439097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4439170Z     
2025-05-07T20:32:12.4439267Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4439271Z 
2025-05-07T20:32:12.4439375Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4439504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4439620Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4439721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4440224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4440333Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4440698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4440928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4441269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4441366Z     kernel = self.compile(
2025-05-07T20:32:12.4441758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4441940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4442075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4442080Z 
2025-05-07T20:32:12.4442291Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab9bd750>
2025-05-07T20:32:12.4443112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4443617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5de050>}
2025-05-07T20:32:12.4444370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4444714Z context = <triton._C.libtriton.ir.context object at 0x7f06ab970330>
2025-05-07T20:32:12.4444719Z 
2025-05-07T20:32:12.4444891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4445153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4445272Z                            module_map=module_map)
2025-05-07T20:32:12.4445440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4445546Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4445625Z E       ^
2025-05-07T20:32:12.4445983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4445987Z 
2025-05-07T20:32:12.4446412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4446419Z 
2025-05-07T20:32:12.4446525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4446759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4446840Z     T=2048,
2025-05-07T20:32:12.4446917Z     D=7168,
2025-05-07T20:32:12.4447008Z     scale_ub=None,
2025-05-07T20:32:12.4447096Z     contiguous=False,
2025-05-07T20:32:12.4447181Z     compiled=True,
2025-05-07T20:32:12.4447260Z )
2025-05-07T20:32:12.4447480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4447656Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4447660Z 
2025-05-07T20:32:12.4447743Z     @given(
2025-05-07T20:32:12.4447864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4447966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4448094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4448216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4448345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4448419Z     )
2025-05-07T20:32:12.4448669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4448771Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4448847Z         self,
2025-05-07T20:32:12.4448927Z         T: int,
2025-05-07T20:32:12.4449010Z         D: int,
2025-05-07T20:32:12.4449111Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4449203Z         contiguous: bool,
2025-05-07T20:32:12.4449299Z         compiled: bool,
2025-05-07T20:32:12.4449378Z     ) -> None:
2025-05-07T20:32:12.4449474Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4449555Z     
2025-05-07T20:32:12.4449722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4449801Z     
2025-05-07T20:32:12.4449894Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4450026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4450125Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4450209Z         x0 = x[:, :D]
2025-05-07T20:32:12.4450296Z         x1 = x[:, D:]
2025-05-07T20:32:12.4450374Z     
2025-05-07T20:32:12.4450461Z         if contiguous:
2025-05-07T20:32:12.4450556Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4450653Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4450777Z     
2025-05-07T20:32:12.4450872Z         if scale_ub is not None:
2025-05-07T20:32:12.4450986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4451120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4451201Z             )
2025-05-07T20:32:12.4451276Z         else:
2025-05-07T20:32:12.4451375Z             scale_ub_tensor = None
2025-05-07T20:32:12.4451455Z     
2025-05-07T20:32:12.4451588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4451681Z             op = silu_mul_quant
2025-05-07T20:32:12.4451820Z             if compiled:
2025-05-07T20:32:12.4451921Z                 op = torch.compile(op)
2025-05-07T20:32:12.4452106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4452191Z     
2025-05-07T20:32:12.4452284Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4452289Z 
2025-05-07T20:32:12.4452389Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4452526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4452632Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4452739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4453105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4453201Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4453711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4453814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4454178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4454408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4454750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4454856Z     kernel = self.compile(
2025-05-07T20:32:12.4455239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4455415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4455549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4455554Z 
2025-05-07T20:32:12.4455761Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab97cb80>
2025-05-07T20:32:12.4456545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4457052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab5df1c0>}
2025-05-07T20:32:12.4457821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4458012Z context = <triton._C.libtriton.ir.context object at 0x7f06aaf99170>
2025-05-07T20:32:12.4458017Z 
2025-05-07T20:32:12.4458185Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4458455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4458567Z                            module_map=module_map)
2025-05-07T20:32:12.4458737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4458845Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4458924Z E       ^
2025-05-07T20:32:12.4459290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4459338Z 
2025-05-07T20:32:12.4459750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4459755Z 
2025-05-07T20:32:12.4459962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4460192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4460275Z     T=4096,
2025-05-07T20:32:12.4460357Z     D=7168,
2025-05-07T20:32:12.4460439Z     scale_ub=None,
2025-05-07T20:32:12.4460527Z     contiguous=False,
2025-05-07T20:32:12.4460618Z     compiled=True,
2025-05-07T20:32:12.4460737Z )
2025-05-07T20:32:12.4460959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4461216Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4461221Z 
2025-05-07T20:32:12.4461301Z     @given(
2025-05-07T20:32:12.4461421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4461531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4461649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4461774Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4461894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4461969Z     )
2025-05-07T20:32:12.4462221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4462316Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4462393Z         self,
2025-05-07T20:32:12.4462480Z         T: int,
2025-05-07T20:32:12.4462558Z         D: int,
2025-05-07T20:32:12.4462662Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4462760Z         contiguous: bool,
2025-05-07T20:32:12.4462853Z         compiled: bool,
2025-05-07T20:32:12.4462931Z     ) -> None:
2025-05-07T20:32:12.4463037Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4463111Z     
2025-05-07T20:32:12.4463281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4463364Z     
2025-05-07T20:32:12.4463461Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4463597Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4463690Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4463773Z         x0 = x[:, :D]
2025-05-07T20:32:12.4463863Z         x1 = x[:, D:]
2025-05-07T20:32:12.4463937Z     
2025-05-07T20:32:12.4464023Z         if contiguous:
2025-05-07T20:32:12.4464127Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4464220Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4464294Z     
2025-05-07T20:32:12.4464398Z         if scale_ub is not None:
2025-05-07T20:32:12.4464504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4464648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4464732Z             )
2025-05-07T20:32:12.4464813Z         else:
2025-05-07T20:32:12.4464919Z             scale_ub_tensor = None
2025-05-07T20:32:12.4464995Z     
2025-05-07T20:32:12.4465130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4465231Z             op = silu_mul_quant
2025-05-07T20:32:12.4465318Z             if compiled:
2025-05-07T20:32:12.4465423Z                 op = torch.compile(op)
2025-05-07T20:32:12.4465539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4465610Z     
2025-05-07T20:32:12.4465704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4465709Z 
2025-05-07T20:32:12.4465816Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4465944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4466074Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4466177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4466548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4466652Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4467149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4467307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4467668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4467894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4468244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4468340Z     kernel = self.compile(
2025-05-07T20:32:12.4468775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4469026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4469156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4469160Z 
2025-05-07T20:32:12.4469377Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aaf11d50>
2025-05-07T20:32:12.4470150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4470657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf301f0>}
2025-05-07T20:32:12.4471405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4471602Z context = <triton._C.libtriton.ir.context object at 0x7f06aafb7f30>
2025-05-07T20:32:12.4471607Z 
2025-05-07T20:32:12.4471783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4472048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4472167Z                            module_map=module_map)
2025-05-07T20:32:12.4472332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4472433Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4472518Z E       ^
2025-05-07T20:32:12.4472872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4472877Z 
2025-05-07T20:32:12.4473290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4473301Z 
2025-05-07T20:32:12.4473413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4473636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4473723Z     T=16384,
2025-05-07T20:32:12.4473801Z     D=5120,
2025-05-07T20:32:12.4473891Z     scale_ub=1200.0,
2025-05-07T20:32:12.4473986Z     contiguous=False,
2025-05-07T20:32:12.4474069Z     compiled=False,
2025-05-07T20:32:12.4474142Z )
2025-05-07T20:32:12.4474368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4474549Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4474553Z 
2025-05-07T20:32:12.4474639Z     @given(
2025-05-07T20:32:12.4474760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4474858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4474987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4475111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4475228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4475309Z     )
2025-05-07T20:32:12.4475553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4475647Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4475778Z         self,
2025-05-07T20:32:12.4475855Z         T: int,
2025-05-07T20:32:12.4475942Z         D: int,
2025-05-07T20:32:12.4476041Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4476132Z         contiguous: bool,
2025-05-07T20:32:12.4476225Z         compiled: bool,
2025-05-07T20:32:12.4476304Z     ) -> None:
2025-05-07T20:32:12.4476400Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4476478Z     
2025-05-07T20:32:12.4476649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4476766Z     
2025-05-07T20:32:12.4476866Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4476992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4477202Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4477290Z         x0 = x[:, :D]
2025-05-07T20:32:12.4477370Z         x1 = x[:, D:]
2025-05-07T20:32:12.4477445Z     
2025-05-07T20:32:12.4477537Z         if contiguous:
2025-05-07T20:32:12.4477631Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4477729Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4477804Z     
2025-05-07T20:32:12.4477900Z         if scale_ub is not None:
2025-05-07T20:32:12.4478013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4478150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4478227Z             )
2025-05-07T20:32:12.4478310Z         else:
2025-05-07T20:32:12.4478408Z             scale_ub_tensor = None
2025-05-07T20:32:12.4478480Z     
2025-05-07T20:32:12.4478615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4478709Z             op = silu_mul_quant
2025-05-07T20:32:12.4478796Z             if compiled:
2025-05-07T20:32:12.4478910Z                 op = torch.compile(op)
2025-05-07T20:32:12.4479019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4479098Z     
2025-05-07T20:32:12.4479192Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4479196Z 
2025-05-07T20:32:12.4479300Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4479438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4479540Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4479641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4480143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4480244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4480607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4480840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4481189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4481290Z     kernel = self.compile(
2025-05-07T20:32:12.4481677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4481860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4481995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4481999Z 
2025-05-07T20:32:12.4482207Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aafde080>
2025-05-07T20:32:12.4482983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4483496Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf30700>}
2025-05-07T20:32:12.4484248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4484490Z context = <triton._C.libtriton.ir.context object at 0x7f06ab1011f0>
2025-05-07T20:32:12.4484495Z 
2025-05-07T20:32:12.4484662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4484934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4485044Z                            module_map=module_map)
2025-05-07T20:32:12.4485210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4485360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4485444Z E       ^
2025-05-07T20:32:12.4485894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4485900Z 
2025-05-07T20:32:12.4486320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4486327Z 
2025-05-07T20:32:12.4486433Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4486662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4486741Z     T=16384,
2025-05-07T20:32:12.4486827Z     D=5120,
2025-05-07T20:32:12.4486913Z     scale_ub=1200.0,
2025-05-07T20:32:12.4486999Z     contiguous=True,
2025-05-07T20:32:12.4487090Z     compiled=True,
2025-05-07T20:32:12.4487168Z )
2025-05-07T20:32:12.4487385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4487570Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4487575Z 
2025-05-07T20:32:12.4487661Z     @given(
2025-05-07T20:32:12.4487781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4487889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4488006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4488137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4488255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4488329Z     )
2025-05-07T20:32:12.4488581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4488676Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4488753Z         self,
2025-05-07T20:32:12.4488838Z         T: int,
2025-05-07T20:32:12.4488914Z         D: int,
2025-05-07T20:32:12.4489018Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4489117Z         contiguous: bool,
2025-05-07T20:32:12.4489204Z         compiled: bool,
2025-05-07T20:32:12.4489283Z     ) -> None:
2025-05-07T20:32:12.4489389Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4489461Z     
2025-05-07T20:32:12.4489642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4489712Z     
2025-05-07T20:32:12.4489807Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4490281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4490419Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4490529Z         x0 = x[:, :D]
2025-05-07T20:32:12.4490619Z         x1 = x[:, D:]
2025-05-07T20:32:12.4490694Z     
2025-05-07T20:32:12.4490782Z         if contiguous:
2025-05-07T20:32:12.4490883Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4490975Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4491051Z     
2025-05-07T20:32:12.4491151Z         if scale_ub is not None:
2025-05-07T20:32:12.4491259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4491402Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4491489Z             )
2025-05-07T20:32:12.4491576Z         else:
2025-05-07T20:32:12.4491684Z             scale_ub_tensor = None
2025-05-07T20:32:12.4491760Z     
2025-05-07T20:32:12.4491892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4491993Z             op = silu_mul_quant
2025-05-07T20:32:12.4492256Z             if compiled:
2025-05-07T20:32:12.4492360Z                 op = torch.compile(op)
2025-05-07T20:32:12.4492478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4492554Z     
2025-05-07T20:32:12.4492649Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4492653Z 
2025-05-07T20:32:12.4492762Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4492892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4493002Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4493106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4493669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4493773Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4494267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4494368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4494737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4494961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4495310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4495409Z     kernel = self.compile(
2025-05-07T20:32:12.4495791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4495979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4496113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4496118Z 
2025-05-07T20:32:12.4496336Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab17e1d0>
2025-05-07T20:32:12.4497112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4497616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf317e0>}
2025-05-07T20:32:12.4498370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4498571Z context = <triton._C.libtriton.ir.context object at 0x7f06ab1ec1b0>
2025-05-07T20:32:12.4498575Z 
2025-05-07T20:32:12.4498749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4499013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4499126Z                            module_map=module_map)
2025-05-07T20:32:12.4499296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4499400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4499480Z E       ^
2025-05-07T20:32:12.4499914Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4499919Z 
2025-05-07T20:32:12.4500331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4500339Z 
2025-05-07T20:32:12.4500454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4500681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4500762Z     T=16384,
2025-05-07T20:32:12.4500848Z     D=5120,
2025-05-07T20:32:12.4500931Z     scale_ub=None,
2025-05-07T20:32:12.4501024Z     contiguous=False,
2025-05-07T20:32:12.4501110Z     compiled=True,
2025-05-07T20:32:12.4501240Z )
2025-05-07T20:32:12.4501463Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4501645Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4501650Z 
2025-05-07T20:32:12.4501728Z     @given(
2025-05-07T20:32:12.4501855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4501955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4502072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4502198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4502357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4502436Z     )
2025-05-07T20:32:12.4502764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4502861Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4502941Z         self,
2025-05-07T20:32:12.4503018Z         T: int,
2025-05-07T20:32:12.4503098Z         D: int,
2025-05-07T20:32:12.4503205Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4503296Z         contiguous: bool,
2025-05-07T20:32:12.4503381Z         compiled: bool,
2025-05-07T20:32:12.4503465Z     ) -> None:
2025-05-07T20:32:12.4503560Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4503631Z     
2025-05-07T20:32:12.4503806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4503879Z     
2025-05-07T20:32:12.4503979Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4504106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4504202Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4504286Z         x0 = x[:, :D]
2025-05-07T20:32:12.4504370Z         x1 = x[:, D:]
2025-05-07T20:32:12.4504442Z     
2025-05-07T20:32:12.4504534Z         if contiguous:
2025-05-07T20:32:12.4504628Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4504719Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4504799Z     
2025-05-07T20:32:12.4504893Z         if scale_ub is not None:
2025-05-07T20:32:12.4505000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4505139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4505216Z             )
2025-05-07T20:32:12.4505293Z         else:
2025-05-07T20:32:12.4505396Z             scale_ub_tensor = None
2025-05-07T20:32:12.4505470Z     
2025-05-07T20:32:12.4505609Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4505705Z             op = silu_mul_quant
2025-05-07T20:32:12.4505792Z             if compiled:
2025-05-07T20:32:12.4505904Z                 op = torch.compile(op)
2025-05-07T20:32:12.4506013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4506095Z     
2025-05-07T20:32:12.4506195Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4506200Z 
2025-05-07T20:32:12.4506298Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4506426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4506537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4506638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4507016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4507113Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4507613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4507720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4508086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4508314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4508661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4508759Z     kernel = self.compile(
2025-05-07T20:32:12.4509201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4509378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4509506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4509510Z 
2025-05-07T20:32:12.4509723Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab18cd60>
2025-05-07T20:32:12.4510603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4511150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf32680>}
2025-05-07T20:32:12.4511908Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4512108Z context = <triton._C.libtriton.ir.context object at 0x7f06ab214570>
2025-05-07T20:32:12.4512113Z 
2025-05-07T20:32:12.4512284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4512546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4512662Z                            module_map=module_map)
2025-05-07T20:32:12.4512829Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4512938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4513023Z E       ^
2025-05-07T20:32:12.4513378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4513383Z 
2025-05-07T20:32:12.4513802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4513809Z 
2025-05-07T20:32:12.4513916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4514139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4514226Z     T=2048,
2025-05-07T20:32:12.4514304Z     D=5120,
2025-05-07T20:32:12.4514385Z     scale_ub=None,
2025-05-07T20:32:12.4514479Z     contiguous=False,
2025-05-07T20:32:12.4514562Z     compiled=True,
2025-05-07T20:32:12.4514645Z )
2025-05-07T20:32:12.4514866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4515046Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4515050Z 
2025-05-07T20:32:12.4515135Z     @given(
2025-05-07T20:32:12.4515256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4515357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4515483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4515601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4515716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4515796Z     )
2025-05-07T20:32:12.4516044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4516146Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4516222Z         self,
2025-05-07T20:32:12.4516299Z         T: int,
2025-05-07T20:32:12.4516383Z         D: int,
2025-05-07T20:32:12.4516489Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4516582Z         contiguous: bool,
2025-05-07T20:32:12.4516674Z         compiled: bool,
2025-05-07T20:32:12.4516757Z     ) -> None:
2025-05-07T20:32:12.4516855Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4516932Z     
2025-05-07T20:32:12.4517100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4517173Z     
2025-05-07T20:32:12.4517321Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4517446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4517542Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4517622Z         x0 = x[:, :D]
2025-05-07T20:32:12.4517704Z         x1 = x[:, D:]
2025-05-07T20:32:12.4517781Z     
2025-05-07T20:32:12.4517867Z         if contiguous:
2025-05-07T20:32:12.4517963Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4518061Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4518133Z     
2025-05-07T20:32:12.4518224Z         if scale_ub is not None:
2025-05-07T20:32:12.4518382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4518590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4518667Z             )
2025-05-07T20:32:12.4518756Z         else:
2025-05-07T20:32:12.4518853Z             scale_ub_tensor = None
2025-05-07T20:32:12.4518928Z     
2025-05-07T20:32:12.4519066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4519160Z             op = silu_mul_quant
2025-05-07T20:32:12.4519253Z             if compiled:
2025-05-07T20:32:12.4519357Z                 op = torch.compile(op)
2025-05-07T20:32:12.4519465Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4519548Z     
2025-05-07T20:32:12.4519640Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4519645Z 
2025-05-07T20:32:12.4519743Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4519879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4519981Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4520083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4520468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4520563Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4521062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4521164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4521525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4521758Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4522104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4522206Z     kernel = self.compile(
2025-05-07T20:32:12.4522588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4522772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4522905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4522910Z 
2025-05-07T20:32:12.4523116Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab278bb0>
2025-05-07T20:32:12.4523898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4524404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf32560>}
2025-05-07T20:32:12.4525150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4525356Z context = <triton._C.libtriton.ir.context object at 0x7f06ab2cb1b0>
2025-05-07T20:32:12.4525361Z 
2025-05-07T20:32:12.4525528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4525798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4525958Z                            module_map=module_map)
2025-05-07T20:32:12.4526121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4526230Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4526309Z E       ^
2025-05-07T20:32:12.4526665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4526676Z 
2025-05-07T20:32:12.4527087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4527132Z 
2025-05-07T20:32:12.4527317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4527544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4527625Z     T=2048,
2025-05-07T20:32:12.4527705Z     D=5120,
2025-05-07T20:32:12.4527796Z     scale_ub=1200.0,
2025-05-07T20:32:12.4527887Z     contiguous=False,
2025-05-07T20:32:12.4527972Z     compiled=True,
2025-05-07T20:32:12.4528050Z )
2025-05-07T20:32:12.4528264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4528447Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4528452Z 
2025-05-07T20:32:12.4528529Z     @given(
2025-05-07T20:32:12.4528651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4528760Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4528878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4528999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4529127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4529201Z     )
2025-05-07T20:32:12.4529446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4529549Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4529627Z         self,
2025-05-07T20:32:12.4529732Z         T: int,
2025-05-07T20:32:12.4529809Z         D: int,
2025-05-07T20:32:12.4529911Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4530009Z         contiguous: bool,
2025-05-07T20:32:12.4535762Z         compiled: bool,
2025-05-07T20:32:12.4535868Z     ) -> None:
2025-05-07T20:32:12.4535980Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4536058Z     
2025-05-07T20:32:12.4536238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4536325Z     
2025-05-07T20:32:12.4536423Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4536566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4536669Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4536760Z         x0 = x[:, :D]
2025-05-07T20:32:12.4536845Z         x1 = x[:, D:]
2025-05-07T20:32:12.4536930Z     
2025-05-07T20:32:12.4537019Z         if contiguous:
2025-05-07T20:32:12.4537115Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4537215Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4537296Z     
2025-05-07T20:32:12.4537400Z         if scale_ub is not None:
2025-05-07T20:32:12.4537513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4537653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4537740Z             )
2025-05-07T20:32:12.4537821Z         else:
2025-05-07T20:32:12.4537919Z             scale_ub_tensor = None
2025-05-07T20:32:12.4538002Z     
2025-05-07T20:32:12.4538137Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4538232Z             op = silu_mul_quant
2025-05-07T20:32:12.4538332Z             if compiled:
2025-05-07T20:32:12.4538437Z                 op = torch.compile(op)
2025-05-07T20:32:12.4538553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4538636Z     
2025-05-07T20:32:12.4538733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4538739Z 
2025-05-07T20:32:12.4538849Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4539063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4539168Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4539281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4539657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4539757Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4540392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4540549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4540998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4541233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4541584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4541695Z     kernel = self.compile(
2025-05-07T20:32:12.4542087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4542268Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4542407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4542412Z 
2025-05-07T20:32:12.4542627Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab2355a0>
2025-05-07T20:32:12.4543432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4543941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaf33370>}
2025-05-07T20:32:12.4544700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4544895Z context = <triton._C.libtriton.ir.context object at 0x7f06ab0a1b30>
2025-05-07T20:32:12.4544900Z 
2025-05-07T20:32:12.4545068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4545344Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4545462Z                            module_map=module_map)
2025-05-07T20:32:12.4545642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4545747Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4545829Z E       ^
2025-05-07T20:32:12.4546200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4546207Z 
2025-05-07T20:32:12.4546630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4546635Z 
2025-05-07T20:32:12.4546752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4546981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4547063Z     T=4096,
2025-05-07T20:32:12.4547151Z     D=5120,
2025-05-07T20:32:12.4547241Z     scale_ub=1200.0,
2025-05-07T20:32:12.4547331Z     contiguous=True,
2025-05-07T20:32:12.4547433Z     compiled=True,
2025-05-07T20:32:12.4547515Z )
2025-05-07T20:32:12.4547740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4547928Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4547933Z 
2025-05-07T20:32:12.4548014Z     @given(
2025-05-07T20:32:12.4548138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4548327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4548450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4548579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4548701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4548779Z     )
2025-05-07T20:32:12.4549043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4549142Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4549223Z         self,
2025-05-07T20:32:12.4549313Z         T: int,
2025-05-07T20:32:12.4549438Z         D: int,
2025-05-07T20:32:12.4549543Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4549718Z         contiguous: bool,
2025-05-07T20:32:12.4549812Z         compiled: bool,
2025-05-07T20:32:12.4549905Z     ) -> None:
2025-05-07T20:32:12.4550006Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4550083Z     
2025-05-07T20:32:12.4550265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4550346Z     
2025-05-07T20:32:12.4550442Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4550579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4550674Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4550759Z         x0 = x[:, :D]
2025-05-07T20:32:12.4550855Z         x1 = x[:, D:]
2025-05-07T20:32:12.4550934Z     
2025-05-07T20:32:12.4551023Z         if contiguous:
2025-05-07T20:32:12.4551129Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4551223Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4551303Z     
2025-05-07T20:32:12.4551407Z         if scale_ub is not None:
2025-05-07T20:32:12.4551521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4551674Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4551756Z             )
2025-05-07T20:32:12.4551836Z         else:
2025-05-07T20:32:12.4551941Z             scale_ub_tensor = None
2025-05-07T20:32:12.4552019Z     
2025-05-07T20:32:12.4552159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4552260Z             op = silu_mul_quant
2025-05-07T20:32:12.4552350Z             if compiled:
2025-05-07T20:32:12.4552455Z                 op = torch.compile(op)
2025-05-07T20:32:12.4552576Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4552652Z     
2025-05-07T20:32:12.4552750Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4552761Z 
2025-05-07T20:32:12.4552863Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4552996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4553110Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4553220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4553598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4553702Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4554239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4554363Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4554732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4554958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4555313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4555415Z     kernel = self.compile(
2025-05-07T20:32:12.4555812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4555997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4556127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4556132Z 
2025-05-07T20:32:12.4556405Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab071570>
2025-05-07T20:32:12.4557187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4557691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01c310>}
2025-05-07T20:32:12.4558528Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4558761Z context = <triton._C.libtriton.ir.context object at 0x7f06ab0d89b0>
2025-05-07T20:32:12.4558766Z 
2025-05-07T20:32:12.4558944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4559219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4559332Z                            module_map=module_map)
2025-05-07T20:32:12.4559504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4559608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4559690Z E       ^
2025-05-07T20:32:12.4560056Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4560060Z 
2025-05-07T20:32:12.4560478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4560487Z 
2025-05-07T20:32:12.4560603Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4560828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4560910Z     T=128,
2025-05-07T20:32:12.4560996Z     D=5120,
2025-05-07T20:32:12.4561088Z     scale_ub=1200.0,
2025-05-07T20:32:12.4561182Z     contiguous=False,
2025-05-07T20:32:12.4561276Z     compiled=True,
2025-05-07T20:32:12.4561354Z )
2025-05-07T20:32:12.4561574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4561757Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4561762Z 
2025-05-07T20:32:12.4561843Z     @given(
2025-05-07T20:32:12.4561973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4562077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4562201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4562335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4562456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4562535Z     )
2025-05-07T20:32:12.4562790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4562888Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4562978Z         self,
2025-05-07T20:32:12.4563059Z         T: int,
2025-05-07T20:32:12.4563140Z         D: int,
2025-05-07T20:32:12.4563250Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4563346Z         contiguous: bool,
2025-05-07T20:32:12.4563437Z         compiled: bool,
2025-05-07T20:32:12.4563526Z     ) -> None:
2025-05-07T20:32:12.4563624Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4563701Z     
2025-05-07T20:32:12.4563883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4563963Z     
2025-05-07T20:32:12.4564057Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4564194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4564292Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4564388Z         x0 = x[:, :D]
2025-05-07T20:32:12.4564473Z         x1 = x[:, D:]
2025-05-07T20:32:12.4564549Z     
2025-05-07T20:32:12.4564643Z         if contiguous:
2025-05-07T20:32:12.4564791Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4564883Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4564965Z     
2025-05-07T20:32:12.4565061Z         if scale_ub is not None:
2025-05-07T20:32:12.4565173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4565318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4565396Z             )
2025-05-07T20:32:12.4565476Z         else:
2025-05-07T20:32:12.4565581Z             scale_ub_tensor = None
2025-05-07T20:32:12.4565659Z     
2025-05-07T20:32:12.4565792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4565938Z             op = silu_mul_quant
2025-05-07T20:32:12.4566028Z             if compiled:
2025-05-07T20:32:12.4566301Z                 op = torch.compile(op)
2025-05-07T20:32:12.4566416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4566493Z     
2025-05-07T20:32:12.4566597Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4566602Z 
2025-05-07T20:32:12.4566710Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4566840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4566950Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4567055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4567424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4567528Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4568026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4568137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4568506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4568730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4569084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4569184Z     kernel = self.compile(
2025-05-07T20:32:12.4569581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4569761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4569891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4569895Z 
2025-05-07T20:32:12.4570118Z self = <triton.compiler.compiler.ASTSource object at 0x7f06ab0314e0>
2025-05-07T20:32:12.4570904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4571413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01d090>}
2025-05-07T20:32:12.4572162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4572361Z context = <triton._C.libtriton.ir.context object at 0x7f06aac3aff0>
2025-05-07T20:32:12.4572365Z 
2025-05-07T20:32:12.4572541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4572807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4572932Z                            module_map=module_map)
2025-05-07T20:32:12.4573100Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4573202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4573288Z E       ^
2025-05-07T20:32:12.4573649Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4573698Z 
2025-05-07T20:32:12.4574131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4574136Z 
2025-05-07T20:32:12.4574245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4574470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4574559Z     T=16384,
2025-05-07T20:32:12.4574639Z     D=7168,
2025-05-07T20:32:12.4574728Z     scale_ub=1200.0,
2025-05-07T20:32:12.4574866Z     contiguous=True,
2025-05-07T20:32:12.4574954Z     compiled=True,
2025-05-07T20:32:12.4575031Z )
2025-05-07T20:32:12.4575358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4575545Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4575549Z 
2025-05-07T20:32:12.4575638Z     @given(
2025-05-07T20:32:12.4575765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4575868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4575993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4576115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4576234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4576319Z     )
2025-05-07T20:32:12.4576573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4576671Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4576766Z         self,
2025-05-07T20:32:12.4576847Z         T: int,
2025-05-07T20:32:12.4576934Z         D: int,
2025-05-07T20:32:12.4577044Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4577139Z         contiguous: bool,
2025-05-07T20:32:12.4577237Z         compiled: bool,
2025-05-07T20:32:12.4577320Z     ) -> None:
2025-05-07T20:32:12.4577419Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4577501Z     
2025-05-07T20:32:12.4577677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4577754Z     
2025-05-07T20:32:12.4577858Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4577987Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4578081Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4578173Z         x0 = x[:, :D]
2025-05-07T20:32:12.4578256Z         x1 = x[:, D:]
2025-05-07T20:32:12.4578336Z     
2025-05-07T20:32:12.4578431Z         if contiguous:
2025-05-07T20:32:12.4578526Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4578627Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4578703Z     
2025-05-07T20:32:12.4578797Z         if scale_ub is not None:
2025-05-07T20:32:12.4578919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4579059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4579138Z             )
2025-05-07T20:32:12.4579228Z         else:
2025-05-07T20:32:12.4579328Z             scale_ub_tensor = None
2025-05-07T20:32:12.4579407Z     
2025-05-07T20:32:12.4579549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4579648Z             op = silu_mul_quant
2025-05-07T20:32:12.4579740Z             if compiled:
2025-05-07T20:32:12.4579996Z                 op = torch.compile(op)
2025-05-07T20:32:12.4580108Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4580191Z     
2025-05-07T20:32:12.4580286Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4580291Z 
2025-05-07T20:32:12.4580393Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4580533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4580642Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4580745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4581120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4581218Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4581783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4581886Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4582246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4582481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4582822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4582964Z     kernel = self.compile(
2025-05-07T20:32:12.4583436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4583619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4583756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4583763Z 
2025-05-07T20:32:12.4583972Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aac8b010>
2025-05-07T20:32:12.4584747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4585255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01e290>}
2025-05-07T20:32:12.4586012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4586211Z context = <triton._C.libtriton.ir.context object at 0x7f06aac45ab0>
2025-05-07T20:32:12.4586216Z 
2025-05-07T20:32:12.4586388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4586660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4586771Z                            module_map=module_map)
2025-05-07T20:32:12.4586937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4587044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4587125Z E       ^
2025-05-07T20:32:12.4587490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4587498Z 
2025-05-07T20:32:12.4587929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4587933Z 
2025-05-07T20:32:12.4588042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4588274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4588359Z     T=16384,
2025-05-07T20:32:12.4588440Z     D=5120,
2025-05-07T20:32:12.4588532Z     scale_ub=1200.0,
2025-05-07T20:32:12.4588621Z     contiguous=True,
2025-05-07T20:32:12.4588707Z     compiled=False,
2025-05-07T20:32:12.4588790Z )
2025-05-07T20:32:12.4589009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4589189Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4589200Z 
2025-05-07T20:32:12.4589281Z     @given(
2025-05-07T20:32:12.4589404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4589516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4589642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4589765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4590249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4590363Z     )
2025-05-07T20:32:12.4590707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4591005Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4591088Z         self,
2025-05-07T20:32:12.4591169Z         T: int,
2025-05-07T20:32:12.4591260Z         D: int,
2025-05-07T20:32:12.4591363Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4591462Z         contiguous: bool,
2025-05-07T20:32:12.4591552Z         compiled: bool,
2025-05-07T20:32:12.4591634Z     ) -> None:
2025-05-07T20:32:12.4591739Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4591813Z     
2025-05-07T20:32:12.4591987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4592151Z     
2025-05-07T20:32:12.4592247Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4592497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4592603Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4592687Z         x0 = x[:, :D]
2025-05-07T20:32:12.4592771Z         x1 = x[:, D:]
2025-05-07T20:32:12.4592854Z     
2025-05-07T20:32:12.4592945Z         if contiguous:
2025-05-07T20:32:12.4593039Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4593139Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4593215Z     
2025-05-07T20:32:12.4593317Z         if scale_ub is not None:
2025-05-07T20:32:12.4593426Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4593564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4593653Z             )
2025-05-07T20:32:12.4593734Z         else:
2025-05-07T20:32:12.4593833Z             scale_ub_tensor = None
2025-05-07T20:32:12.4593919Z     
2025-05-07T20:32:12.4594055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4594154Z             op = silu_mul_quant
2025-05-07T20:32:12.4594249Z             if compiled:
2025-05-07T20:32:12.4594353Z                 op = torch.compile(op)
2025-05-07T20:32:12.4594465Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4594551Z     
2025-05-07T20:32:12.4594652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4594657Z 
2025-05-07T20:32:12.4594766Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4594898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4595002Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4595110Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4595609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4595713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4596090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4596324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4596677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4596776Z     kernel = self.compile(
2025-05-07T20:32:12.4597170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4597367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4597497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4597501Z 
2025-05-07T20:32:12.4597718Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aac88e80>
2025-05-07T20:32:12.4598496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4599008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01d1b0>}
2025-05-07T20:32:12.4599754Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4599998Z context = <triton._C.libtriton.ir.context object at 0x7f06aac6edb0>
2025-05-07T20:32:12.4600009Z 
2025-05-07T20:32:12.4600177Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4600442Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4600559Z                            module_map=module_map)
2025-05-07T20:32:12.4600769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4600949Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4601040Z E       ^
2025-05-07T20:32:12.4601403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4601407Z 
2025-05-07T20:32:12.4601831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4601836Z 
2025-05-07T20:32:12.4601949Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4602173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4602262Z     T=1,
2025-05-07T20:32:12.4602344Z     D=7168,
2025-05-07T20:32:12.4602431Z     scale_ub=1200.0,
2025-05-07T20:32:12.4602529Z     contiguous=False,
2025-05-07T20:32:12.4602617Z     compiled=False,
2025-05-07T20:32:12.4602698Z )
2025-05-07T20:32:12.4602926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4603105Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4603109Z 
2025-05-07T20:32:12.4603198Z     @given(
2025-05-07T20:32:12.4603320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4603422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4603552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4603673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4603790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4603875Z     )
2025-05-07T20:32:12.4604149Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4604264Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4604361Z         self,
2025-05-07T20:32:12.4604442Z         T: int,
2025-05-07T20:32:12.4604528Z         D: int,
2025-05-07T20:32:12.4604636Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4604729Z         contiguous: bool,
2025-05-07T20:32:12.4604828Z         compiled: bool,
2025-05-07T20:32:12.4604910Z     ) -> None:
2025-05-07T20:32:12.4605009Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4605090Z     
2025-05-07T20:32:12.4605261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4605338Z     
2025-05-07T20:32:12.4605445Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4605573Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4605666Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4605759Z         x0 = x[:, :D]
2025-05-07T20:32:12.4605844Z         x1 = x[:, D:]
2025-05-07T20:32:12.4605926Z     
2025-05-07T20:32:12.4606014Z         if contiguous:
2025-05-07T20:32:12.4606109Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4606208Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4606284Z     
2025-05-07T20:32:12.4606379Z         if scale_ub is not None:
2025-05-07T20:32:12.4606498Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4606640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4606724Z             )
2025-05-07T20:32:12.4606812Z         else:
2025-05-07T20:32:12.4606910Z             scale_ub_tensor = None
2025-05-07T20:32:12.4606987Z     
2025-05-07T20:32:12.4607131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4607278Z             op = silu_mul_quant
2025-05-07T20:32:12.4607372Z             if compiled:
2025-05-07T20:32:12.4607482Z                 op = torch.compile(op)
2025-05-07T20:32:12.4607596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4607680Z     
2025-05-07T20:32:12.4607776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4607780Z 
2025-05-07T20:32:12.4607882Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4608019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4608170Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4608274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4608882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4608989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4609361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4609593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4609936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4610046Z     kernel = self.compile(
2025-05-07T20:32:12.4610429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4610606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4610745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4610750Z 
2025-05-07T20:32:12.4610964Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aac48040>
2025-05-07T20:32:12.4611745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4612261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01e680>}
2025-05-07T20:32:12.4613012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4613205Z context = <triton._C.libtriton.ir.context object at 0x7f06aadda670>
2025-05-07T20:32:12.4613213Z 
2025-05-07T20:32:12.4613390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4613662Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4613773Z                            module_map=module_map)
2025-05-07T20:32:12.4613946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4614051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4614132Z E       ^
2025-05-07T20:32:12.4614495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4614499Z 
2025-05-07T20:32:12.4614913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4614917Z 
2025-05-07T20:32:12.4615024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4615258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4615340Z     T=4096,
2025-05-07T20:32:12.4615433Z     D=7168,
2025-05-07T20:32:12.4615523Z     scale_ub=1200.0,
2025-05-07T20:32:12.4615615Z     contiguous=False,
2025-05-07T20:32:12.4615710Z     compiled=True,
2025-05-07T20:32:12.4615785Z )
2025-05-07T20:32:12.4616008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4616241Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4616246Z 
2025-05-07T20:32:12.4616327Z     @given(
2025-05-07T20:32:12.4616449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4616560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4616679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4616806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4616926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4617045Z     )
2025-05-07T20:32:12.4617298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4617483Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4617565Z         self,
2025-05-07T20:32:12.4617653Z         T: int,
2025-05-07T20:32:12.4617733Z         D: int,
2025-05-07T20:32:12.4617839Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4617944Z         contiguous: bool,
2025-05-07T20:32:12.4618041Z         compiled: bool,
2025-05-07T20:32:12.4618123Z     ) -> None:
2025-05-07T20:32:12.4618227Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4618304Z     
2025-05-07T20:32:12.4618482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4618560Z     
2025-05-07T20:32:12.4618657Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4618795Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4618890Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4618976Z         x0 = x[:, :D]
2025-05-07T20:32:12.4619070Z         x1 = x[:, D:]
2025-05-07T20:32:12.4619147Z     
2025-05-07T20:32:12.4619236Z         if contiguous:
2025-05-07T20:32:12.4619343Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4619440Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4619518Z     
2025-05-07T20:32:12.4619618Z         if scale_ub is not None:
2025-05-07T20:32:12.4619728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4619976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4620058Z             )
2025-05-07T20:32:12.4620139Z         else:
2025-05-07T20:32:12.4620244Z             scale_ub_tensor = None
2025-05-07T20:32:12.4620319Z     
2025-05-07T20:32:12.4620453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4620552Z             op = silu_mul_quant
2025-05-07T20:32:12.4620641Z             if compiled:
2025-05-07T20:32:12.4620745Z                 op = torch.compile(op)
2025-05-07T20:32:12.4620861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4620941Z     
2025-05-07T20:32:12.4621036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4621053Z 
2025-05-07T20:32:12.4621157Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4621289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4621400Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4621505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4621883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4621988Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4622494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4622596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4622961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4623191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4623543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4623647Z     kernel = self.compile(
2025-05-07T20:32:12.4624036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4624272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4624402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4624407Z 
2025-05-07T20:32:12.4624618Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aad6fd30>
2025-05-07T20:32:12.4625407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4626118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06ab01fb50>}
2025-05-07T20:32:12.4626888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4627088Z context = <triton._C.libtriton.ir.context object at 0x7f06aade5d30>
2025-05-07T20:32:12.4627093Z 
2025-05-07T20:32:12.4627268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4627532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4627642Z                            module_map=module_map)
2025-05-07T20:32:12.4627813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4627918Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4628004Z E       ^
2025-05-07T20:32:12.4628366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4628370Z 
2025-05-07T20:32:12.4628785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4628792Z 
2025-05-07T20:32:12.4628907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4629131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4629221Z     T=128,
2025-05-07T20:32:12.4629304Z     D=7168,
2025-05-07T20:32:12.4629391Z     scale_ub=1200.0,
2025-05-07T20:32:12.4629488Z     contiguous=False,
2025-05-07T20:32:12.4629575Z     compiled=True,
2025-05-07T20:32:12.4629652Z )
2025-05-07T20:32:12.4629876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4630055Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4630060Z 
2025-05-07T20:32:12.4630142Z     @given(
2025-05-07T20:32:12.4630276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4630379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4630501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4630628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4630750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4630833Z     )
2025-05-07T20:32:12.4631081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4631178Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4631263Z         self,
2025-05-07T20:32:12.4631343Z         T: int,
2025-05-07T20:32:12.4631423Z         D: int,
2025-05-07T20:32:12.4631533Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4631627Z         contiguous: bool,
2025-05-07T20:32:12.4631719Z         compiled: bool,
2025-05-07T20:32:12.4631808Z     ) -> None:
2025-05-07T20:32:12.4631907Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4631988Z     
2025-05-07T20:32:12.4632164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4632241Z     
2025-05-07T20:32:12.4632342Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4632469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4632617Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4632707Z         x0 = x[:, :D]
2025-05-07T20:32:12.4632791Z         x1 = x[:, D:]
2025-05-07T20:32:12.4632865Z     
2025-05-07T20:32:12.4632957Z         if contiguous:
2025-05-07T20:32:12.4633054Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4633147Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4633230Z     
2025-05-07T20:32:12.4633325Z         if scale_ub is not None:
2025-05-07T20:32:12.4633434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4633582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4633708Z             )
2025-05-07T20:32:12.4633794Z         else:
2025-05-07T20:32:12.4633966Z             scale_ub_tensor = None
2025-05-07T20:32:12.4634043Z     
2025-05-07T20:32:12.4634185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4634279Z             op = silu_mul_quant
2025-05-07T20:32:12.4634368Z             if compiled:
2025-05-07T20:32:12.4634481Z                 op = torch.compile(op)
2025-05-07T20:32:12.4634592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4634668Z     
2025-05-07T20:32:12.4634770Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4634774Z 
2025-05-07T20:32:12.4634876Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4635012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4635116Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4635218Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4635601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4635705Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4636211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4636319Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4636686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4636916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4637259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4637357Z     kernel = self.compile(
2025-05-07T20:32:12.4637752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4637933Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4638070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4638075Z 
2025-05-07T20:32:12.4638288Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aad826b0>
2025-05-07T20:32:12.4639060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4639576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4c670>}
2025-05-07T20:32:12.4640321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4640523Z context = <triton._C.libtriton.ir.context object at 0x7f06aab49ab0>
2025-05-07T20:32:12.4640527Z 
2025-05-07T20:32:12.4640700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4640964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4641082Z                            module_map=module_map)
2025-05-07T20:32:12.4641294Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4641397Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4641485Z E       ^
2025-05-07T20:32:12.4641842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4641847Z 
2025-05-07T20:32:12.4642266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4642270Z 
2025-05-07T20:32:12.4642452Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4642677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4642841Z     T=2048,
2025-05-07T20:32:12.4642922Z     D=7168,
2025-05-07T20:32:12.4643011Z     scale_ub=None,
2025-05-07T20:32:12.4643106Z     contiguous=True,
2025-05-07T20:32:12.4643193Z     compiled=True,
2025-05-07T20:32:12.4643278Z )
2025-05-07T20:32:12.4643504Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4643680Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4643684Z 
2025-05-07T20:32:12.4643769Z     @given(
2025-05-07T20:32:12.4643892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4643998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4644123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4644247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4644375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4644451Z     )
2025-05-07T20:32:12.4644707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4644810Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4644890Z         self,
2025-05-07T20:32:12.4644971Z         T: int,
2025-05-07T20:32:12.4645058Z         D: int,
2025-05-07T20:32:12.4645162Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4645260Z         contiguous: bool,
2025-05-07T20:32:12.4645354Z         compiled: bool,
2025-05-07T20:32:12.4645437Z     ) -> None:
2025-05-07T20:32:12.4645537Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4645618Z     
2025-05-07T20:32:12.4645789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4645868Z     
2025-05-07T20:32:12.4645971Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4646099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4646201Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4646286Z         x0 = x[:, :D]
2025-05-07T20:32:12.4646369Z         x1 = x[:, D:]
2025-05-07T20:32:12.4646459Z     
2025-05-07T20:32:12.4646547Z         if contiguous:
2025-05-07T20:32:12.4646643Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4646741Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4646816Z     
2025-05-07T20:32:12.4646909Z         if scale_ub is not None:
2025-05-07T20:32:12.4647027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4647167Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4647246Z             )
2025-05-07T20:32:12.4647335Z         else:
2025-05-07T20:32:12.4647433Z             scale_ub_tensor = None
2025-05-07T20:32:12.4647517Z     
2025-05-07T20:32:12.4647648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4647740Z             op = silu_mul_quant
2025-05-07T20:32:12.4647835Z             if compiled:
2025-05-07T20:32:12.4647938Z                 op = torch.compile(op)
2025-05-07T20:32:12.4648050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4648131Z     
2025-05-07T20:32:12.4648232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4648237Z 
2025-05-07T20:32:12.4648341Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4648478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4648583Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4648743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4649109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4649206Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4649706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4649808Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4650165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4650510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4650858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4650963Z     kernel = self.compile(
2025-05-07T20:32:12.4651346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4651527Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4651661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4651666Z 
2025-05-07T20:32:12.4651876Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aab9f520>
2025-05-07T20:32:12.4652668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4653181Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4d1b0>}
2025-05-07T20:32:12.4653932Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4654137Z context = <triton._C.libtriton.ir.context object at 0x7f06aabd8770>
2025-05-07T20:32:12.4654141Z 
2025-05-07T20:32:12.4654310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4654585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4654697Z                            module_map=module_map)
2025-05-07T20:32:12.4654868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4654977Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4655063Z E       ^
2025-05-07T20:32:12.4655417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4655427Z 
2025-05-07T20:32:12.4655841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4655848Z 
2025-05-07T20:32:12.4655956Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4656186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4656269Z     T=16384,
2025-05-07T20:32:12.4656348Z     D=5120,
2025-05-07T20:32:12.4656441Z     scale_ub=None,
2025-05-07T20:32:12.4656532Z     contiguous=False,
2025-05-07T20:32:12.4656621Z     compiled=False,
2025-05-07T20:32:12.4656708Z )
2025-05-07T20:32:12.4656926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4657117Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4657126Z 
2025-05-07T20:32:12.4657208Z     @given(
2025-05-07T20:32:12.4657334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4657443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4657564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4657732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4657856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4657934Z     )
2025-05-07T20:32:12.4658189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4658287Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4658366Z         self,
2025-05-07T20:32:12.4658454Z         T: int,
2025-05-07T20:32:12.4658533Z         D: int,
2025-05-07T20:32:12.4658637Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4658785Z         contiguous: bool,
2025-05-07T20:32:12.4658878Z         compiled: bool,
2025-05-07T20:32:12.4658960Z     ) -> None:
2025-05-07T20:32:12.4659138Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4659218Z     
2025-05-07T20:32:12.4659389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4659472Z     
2025-05-07T20:32:12.4659569Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4659702Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4661644Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4661654Z 
2025-05-07T20:32:12.4661791Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.4661795Z 
2025-05-07T20:32:12.4661904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4668107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4668228Z     T=4096,
2025-05-07T20:32:12.4668311Z     D=7168,
2025-05-07T20:32:12.4668407Z     scale_ub=1200.0,
2025-05-07T20:32:12.4668508Z     contiguous=True,
2025-05-07T20:32:12.4668599Z     compiled=True,
2025-05-07T20:32:12.4668677Z )
2025-05-07T20:32:12.4668912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4669093Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4669099Z 
2025-05-07T20:32:12.4669183Z     @given(
2025-05-07T20:32:12.4669315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4669424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4669559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4669681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4669801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4669885Z     )
2025-05-07T20:32:12.4670141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4670244Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4670334Z         self,
2025-05-07T20:32:12.4670415Z         T: int,
2025-05-07T20:32:12.4670496Z         D: int,
2025-05-07T20:32:12.4670605Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4670699Z         contiguous: bool,
2025-05-07T20:32:12.4670789Z         compiled: bool,
2025-05-07T20:32:12.4670883Z     ) -> None:
2025-05-07T20:32:12.4670982Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4671065Z     
2025-05-07T20:32:12.4671236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4671317Z     
2025-05-07T20:32:12.4671422Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4671556Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4673352Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4673455Z 
2025-05-07T20:32:12.4673580Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.4673585Z 
2025-05-07T20:32:12.4673694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4673979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4674063Z     T=16384,
2025-05-07T20:32:12.4674223Z     D=7168,
2025-05-07T20:32:12.4674319Z     scale_ub=None,
2025-05-07T20:32:12.4674410Z     contiguous=False,
2025-05-07T20:32:12.4674508Z     compiled=False,
2025-05-07T20:32:12.4674588Z )
2025-05-07T20:32:12.4674808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4675002Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4675007Z 
2025-05-07T20:32:12.4675092Z     @given(
2025-05-07T20:32:12.4675214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4675329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4675448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4675569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4675700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4675784Z     )
2025-05-07T20:32:12.4676046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4676146Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4676227Z         self,
2025-05-07T20:32:12.4676317Z         T: int,
2025-05-07T20:32:12.4676399Z         D: int,
2025-05-07T20:32:12.4676504Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4679650Z         contiguous: bool,
2025-05-07T20:32:12.4679762Z         compiled: bool,
2025-05-07T20:32:12.4679845Z     ) -> None:
2025-05-07T20:32:12.4679952Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4680031Z     
2025-05-07T20:32:12.4680203Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4682009Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4682018Z 
2025-05-07T20:32:12.4682143Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4682182Z 
2025-05-07T20:32:12.4682291Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4682524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4682606Z     T=2048,
2025-05-07T20:32:12.4682685Z     D=7168,
2025-05-07T20:32:12.4682779Z     scale_ub=1200.0,
2025-05-07T20:32:12.4682865Z     contiguous=True,
2025-05-07T20:32:12.4682949Z     compiled=True,
2025-05-07T20:32:12.4683039Z )
2025-05-07T20:32:12.4683255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4683431Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4683436Z 
2025-05-07T20:32:12.4683525Z     @given(
2025-05-07T20:32:12.4683646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4683750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4683874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4684062Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4684189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4684267Z     )
2025-05-07T20:32:12.4684518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4684623Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4684702Z         self,
2025-05-07T20:32:12.4684783Z         T: int,
2025-05-07T20:32:12.4684871Z         D: int,
2025-05-07T20:32:12.4684976Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4685070Z         contiguous: bool,
2025-05-07T20:32:12.4685213Z         compiled: bool,
2025-05-07T20:32:12.4685294Z     ) -> None:
2025-05-07T20:32:12.4685440Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4685518Z     
2025-05-07T20:32:12.4685690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4685776Z     
2025-05-07T20:32:12.4685874Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4686008Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4687823Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4687832Z 
2025-05-07T20:32:12.4687957Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.4687961Z 
2025-05-07T20:32:12.4688077Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4688298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4688378Z     T=2048,
2025-05-07T20:32:12.4688470Z     D=7168,
2025-05-07T20:32:12.4688647Z     scale_ub=None,
2025-05-07T20:32:12.4688746Z     contiguous=True,
2025-05-07T20:32:12.4688835Z     compiled=False,
2025-05-07T20:32:12.4688911Z )
2025-05-07T20:32:12.4689137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4689314Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4689319Z 
2025-05-07T20:32:12.4689399Z     @given(
2025-05-07T20:32:12.4689527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4689631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4689749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4690169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4690343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4690461Z     )
2025-05-07T20:32:12.4690715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4690818Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4690904Z         self,
2025-05-07T20:32:12.4690984Z         T: int,
2025-05-07T20:32:12.4691064Z         D: int,
2025-05-07T20:32:12.4691173Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4691265Z         contiguous: bool,
2025-05-07T20:32:12.4691353Z         compiled: bool,
2025-05-07T20:32:12.4691440Z     ) -> None:
2025-05-07T20:32:12.4691537Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4691612Z     
2025-05-07T20:32:12.4691788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4691868Z     
2025-05-07T20:32:12.4691964Z >       x_sign = torch.sign(x)
2025-05-07T20:32:12.4693780Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4693879Z 
2025-05-07T20:32:12.4694011Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:12.4694016Z 
2025-05-07T20:32:12.4694124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4694346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4694499Z     T=1,
2025-05-07T20:32:12.4694578Z     D=7168,
2025-05-07T20:32:12.4694668Z     scale_ub=1200.0,
2025-05-07T20:32:12.4694816Z     contiguous=True,
2025-05-07T20:32:12.4694903Z     compiled=False,
2025-05-07T20:32:12.4694983Z )
2025-05-07T20:32:12.4695200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4695365Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4695380Z 
2025-05-07T20:32:12.4695459Z     @given(
2025-05-07T20:32:12.4695580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4695690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4695806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4695926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4696049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4696126Z     )
2025-05-07T20:32:12.4696371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4696482Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4696562Z         self,
2025-05-07T20:32:12.4696651Z         T: int,
2025-05-07T20:32:12.4696729Z         D: int,
2025-05-07T20:32:12.4696828Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4696927Z         contiguous: bool,
2025-05-07T20:32:12.4697015Z         compiled: bool,
2025-05-07T20:32:12.4697097Z     ) -> None:
2025-05-07T20:32:12.4697295Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4697372Z     
2025-05-07T20:32:12.4697542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4697625Z     
2025-05-07T20:32:12.4697721Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4697847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4697948Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4698032Z         x0 = x[:, :D]
2025-05-07T20:32:12.4698115Z         x1 = x[:, D:]
2025-05-07T20:32:12.4698200Z     
2025-05-07T20:32:12.4698289Z         if contiguous:
2025-05-07T20:32:12.4698394Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4698494Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4698568Z     
2025-05-07T20:32:12.4698672Z         if scale_ub is not None:
2025-05-07T20:32:12.4698780Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4698918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4699006Z             )
2025-05-07T20:32:12.4699088Z         else:
2025-05-07T20:32:12.4699186Z             scale_ub_tensor = None
2025-05-07T20:32:12.4699266Z     
2025-05-07T20:32:12.4699400Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4699492Z             op = silu_mul_quant
2025-05-07T20:32:12.4699588Z             if compiled:
2025-05-07T20:32:12.4699691Z                 op = torch.compile(op)
2025-05-07T20:32:12.4699882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4699958Z     
2025-05-07T20:32:12.4700056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4700060Z 
2025-05-07T20:32:12.4700166Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4700298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4700407Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4700517Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4701023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4701174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4701541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4701764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4702119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4702215Z     kernel = self.compile(
2025-05-07T20:32:12.4702715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4702898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4703026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4703030Z 
2025-05-07T20:32:12.4703249Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aab9cca0>
2025-05-07T20:32:12.4704041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4704542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4ee60>}
2025-05-07T20:32:12.4705306Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4705501Z context = <triton._C.libtriton.ir.context object at 0x7f06aaeab270>
2025-05-07T20:32:12.4705506Z 
2025-05-07T20:32:12.4705682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4705998Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4706110Z                            module_map=module_map)
2025-05-07T20:32:12.4706285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4706385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4706471Z E       ^
2025-05-07T20:32:12.4706830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4706835Z 
2025-05-07T20:32:12.4707250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4707257Z 
2025-05-07T20:32:12.4707369Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4707590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4707678Z     T=128,
2025-05-07T20:32:12.4707757Z     D=5120,
2025-05-07T20:32:12.4707847Z     scale_ub=None,
2025-05-07T20:32:12.4707944Z     contiguous=True,
2025-05-07T20:32:12.4708030Z     compiled=False,
2025-05-07T20:32:12.4708106Z )
2025-05-07T20:32:12.4708327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4708498Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4708502Z 
2025-05-07T20:32:12.4708584Z     @given(
2025-05-07T20:32:12.4708710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4708813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4708944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4709064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4709183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4709266Z     )
2025-05-07T20:32:12.4709515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4709611Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4709749Z         self,
2025-05-07T20:32:12.4709827Z         T: int,
2025-05-07T20:32:12.4709905Z         D: int,
2025-05-07T20:32:12.4710013Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4710105Z         contiguous: bool,
2025-05-07T20:32:12.4710193Z         compiled: bool,
2025-05-07T20:32:12.4710283Z     ) -> None:
2025-05-07T20:32:12.4710380Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4710461Z     
2025-05-07T20:32:12.4710630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4710836Z     
2025-05-07T20:32:12.4710938Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4711065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4711195Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4711289Z         x0 = x[:, :D]
2025-05-07T20:32:12.4711372Z         x1 = x[:, D:]
2025-05-07T20:32:12.4711446Z     
2025-05-07T20:32:12.4711538Z         if contiguous:
2025-05-07T20:32:12.4711633Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4711728Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4711809Z     
2025-05-07T20:32:12.4711902Z         if scale_ub is not None:
2025-05-07T20:32:12.4712014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4712157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4712237Z             )
2025-05-07T20:32:12.4712324Z         else:
2025-05-07T20:32:12.4712422Z             scale_ub_tensor = None
2025-05-07T20:32:12.4712496Z     
2025-05-07T20:32:12.4712636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4712733Z             op = silu_mul_quant
2025-05-07T20:32:12.4712821Z             if compiled:
2025-05-07T20:32:12.4712933Z                 op = torch.compile(op)
2025-05-07T20:32:12.4713044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4713118Z     
2025-05-07T20:32:12.4713218Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4713222Z 
2025-05-07T20:32:12.4713326Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4713514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4713626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4713729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4714241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4714340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4714699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4714934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4715279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4715381Z     kernel = self.compile(
2025-05-07T20:32:12.4715766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4715944Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4716079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4716083Z 
2025-05-07T20:32:12.4716289Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aae4e3e0>
2025-05-07T20:32:12.4717067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4717571Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aab4f7f0>}
2025-05-07T20:32:12.4718329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4718573Z context = <triton._C.libtriton.ir.context object at 0x7f06aaeea630>
2025-05-07T20:32:12.4718578Z 
2025-05-07T20:32:12.4718747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4719014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4719126Z                            module_map=module_map)
2025-05-07T20:32:12.4719289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4719441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4719519Z E       ^
2025-05-07T20:32:12.4719918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4719929Z 
2025-05-07T20:32:12.4720348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4720357Z 
2025-05-07T20:32:12.4720466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4720695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4720775Z     T=128,
2025-05-07T20:32:12.4720853Z     D=7168,
2025-05-07T20:32:12.4720947Z     scale_ub=None,
2025-05-07T20:32:12.4721034Z     contiguous=True,
2025-05-07T20:32:12.4721120Z     compiled=False,
2025-05-07T20:32:12.4721201Z )
2025-05-07T20:32:12.4721417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4721601Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4721606Z 
2025-05-07T20:32:12.4721689Z     @given(
2025-05-07T20:32:12.4721810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4721919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4722037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4722206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4722327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4722403Z     )
2025-05-07T20:32:12.4722648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4722749Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4722827Z         self,
2025-05-07T20:32:12.4722909Z         T: int,
2025-05-07T20:32:12.4722987Z         D: int,
2025-05-07T20:32:12.4723089Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4723185Z         contiguous: bool,
2025-05-07T20:32:12.4723277Z         compiled: bool,
2025-05-07T20:32:12.4723358Z     ) -> None:
2025-05-07T20:32:12.4723464Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4723539Z     
2025-05-07T20:32:12.4723707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4723789Z     
2025-05-07T20:32:12.4723885Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4724017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4724116Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4724200Z         x0 = x[:, :D]
2025-05-07T20:32:12.4724287Z         x1 = x[:, D:]
2025-05-07T20:32:12.4724361Z     
2025-05-07T20:32:12.4724446Z         if contiguous:
2025-05-07T20:32:12.4724546Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4724637Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4724710Z     
2025-05-07T20:32:12.4724809Z         if scale_ub is not None:
2025-05-07T20:32:12.4724918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4725060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4725146Z             )
2025-05-07T20:32:12.4725229Z         else:
2025-05-07T20:32:12.4725327Z             scale_ub_tensor = None
2025-05-07T20:32:12.4725410Z     
2025-05-07T20:32:12.4725541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4725639Z             op = silu_mul_quant
2025-05-07T20:32:12.4725775Z             if compiled:
2025-05-07T20:32:12.4725887Z                 op = torch.compile(op)
2025-05-07T20:32:12.4726007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4726085Z     
2025-05-07T20:32:12.4726179Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4726184Z 
2025-05-07T20:32:12.4726289Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4726418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4726520Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4726628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4727217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4727325Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4727691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4727916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4728267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4728365Z     kernel = self.compile(
2025-05-07T20:32:12.4728746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4728929Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4729057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4729065Z 
2025-05-07T20:32:12.4729281Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aaa03cd0>
2025-05-07T20:32:12.4730051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4730603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaa98160>}
2025-05-07T20:32:12.4731361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4731553Z context = <triton._C.libtriton.ir.context object at 0x7f06aaa6dc70>
2025-05-07T20:32:12.4731557Z 
2025-05-07T20:32:12.4731734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4731999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4732114Z                            module_map=module_map)
2025-05-07T20:32:12.4732280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4732400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4732494Z E       ^
2025-05-07T20:32:12.4732857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4732862Z 
2025-05-07T20:32:12.4733283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4733288Z 
2025-05-07T20:32:12.4733396Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4733619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4733708Z     T=2048,
2025-05-07T20:32:12.4733787Z     D=7168,
2025-05-07T20:32:12.4733880Z     scale_ub=1200.0,
2025-05-07T20:32:12.4733973Z     contiguous=True,
2025-05-07T20:32:12.4734060Z     compiled=False,
2025-05-07T20:32:12.4734144Z )
2025-05-07T20:32:12.4734361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4734536Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4734613Z 
2025-05-07T20:32:12.4734702Z     @given(
2025-05-07T20:32:12.4734824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4734926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4735051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4735169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4735293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4735372Z     )
2025-05-07T20:32:12.4735621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4735768Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4735848Z         self,
2025-05-07T20:32:12.4735965Z         T: int,
2025-05-07T20:32:12.4736050Z         D: int,
2025-05-07T20:32:12.4736153Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4736246Z         contiguous: bool,
2025-05-07T20:32:12.4736339Z         compiled: bool,
2025-05-07T20:32:12.4736424Z     ) -> None:
2025-05-07T20:32:12.4736523Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4736605Z     
2025-05-07T20:32:12.4736776Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4738566Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4738575Z 
2025-05-07T20:32:12.4738697Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4738702Z 
2025-05-07T20:32:12.4738812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4739083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4739166Z     T=1,
2025-05-07T20:32:12.4739250Z     D=5120,
2025-05-07T20:32:12.4739335Z     scale_ub=1200.0,
2025-05-07T20:32:12.4739422Z     contiguous=True,
2025-05-07T20:32:12.4739512Z     compiled=False,
2025-05-07T20:32:12.4739586Z )
2025-05-07T20:32:12.4739864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4740038Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4740046Z 
2025-05-07T20:32:12.4740123Z     @given(
2025-05-07T20:32:12.4740248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4740350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4740465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4740589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4740703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4740786Z     )
2025-05-07T20:32:12.4741040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4741130Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4741206Z         self,
2025-05-07T20:32:12.4741288Z         T: int,
2025-05-07T20:32:12.4741365Z         D: int,
2025-05-07T20:32:12.4741473Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4741561Z         contiguous: bool,
2025-05-07T20:32:12.4741647Z         compiled: bool,
2025-05-07T20:32:12.4741731Z     ) -> None:
2025-05-07T20:32:12.4741829Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4741899Z     
2025-05-07T20:32:12.4742077Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4742149Z     
2025-05-07T20:32:12.4742242Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4742373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4742461Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4742594Z         x0 = x[:, :D]
2025-05-07T20:32:12.4742685Z         x1 = x[:, D:]
2025-05-07T20:32:12.4742758Z     
2025-05-07T20:32:12.4742841Z         if contiguous:
2025-05-07T20:32:12.4742938Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4743028Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4743104Z     
2025-05-07T20:32:12.4743195Z         if scale_ub is not None:
2025-05-07T20:32:12.4743300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4743440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4743560Z             )
2025-05-07T20:32:12.4743638Z         else:
2025-05-07T20:32:12.4743740Z             scale_ub_tensor = None
2025-05-07T20:32:12.4743811Z     
2025-05-07T20:32:12.4743982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4744081Z             op = silu_mul_quant
2025-05-07T20:32:12.4744168Z             if compiled:
2025-05-07T20:32:12.4744269Z                 op = torch.compile(op)
2025-05-07T20:32:12.4744390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4744465Z     
2025-05-07T20:32:12.4744564Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4744568Z 
2025-05-07T20:32:12.4744666Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4744794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4744903Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4745001Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4745499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4745610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4745977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4746212Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4746604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4746711Z     kernel = self.compile(
2025-05-07T20:32:12.4747101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4747276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4747402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4747415Z 
2025-05-07T20:32:12.4747622Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aaa53be0>
2025-05-07T20:32:12.4748412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4748929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aaa98940>}
2025-05-07T20:32:12.4749684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4749880Z context = <triton._C.libtriton.ir.context object at 0x7f06aaa54170>
2025-05-07T20:32:12.4749885Z 
2025-05-07T20:32:12.4750054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4750321Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4750440Z                            module_map=module_map)
2025-05-07T20:32:12.4750606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4750709Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4750787Z E       ^
2025-05-07T20:32:12.4751143Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4751190Z 
2025-05-07T20:32:12.4751612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4751616Z 
2025-05-07T20:32:12.4751720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4751942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4752022Z     T=2048,
2025-05-07T20:32:12.4752100Z     D=5120,
2025-05-07T20:32:12.4752189Z     scale_ub=None,
2025-05-07T20:32:12.4752317Z     contiguous=True,
2025-05-07T20:32:12.4752404Z     compiled=False,
2025-05-07T20:32:12.4752486Z )
2025-05-07T20:32:12.4752739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4752915Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4752919Z 
2025-05-07T20:32:12.4753005Z     @given(
2025-05-07T20:32:12.4753132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4753232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4753356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4753475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4753598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4753669Z     )
2025-05-07T20:32:12.4753912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4754012Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4754090Z         self,
2025-05-07T20:32:12.4754165Z         T: int,
2025-05-07T20:32:12.4754247Z         D: int,
2025-05-07T20:32:12.4754348Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4754436Z         contiguous: bool,
2025-05-07T20:32:12.4754527Z         compiled: bool,
2025-05-07T20:32:12.4754607Z     ) -> None:
2025-05-07T20:32:12.4754700Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4754779Z     
2025-05-07T20:32:12.4754997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4755083Z     
2025-05-07T20:32:12.4755179Z >       x_sign = torch.sign(x)
2025-05-07T20:32:12.4756957Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4756971Z 
2025-05-07T20:32:12.4757089Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:12.4757094Z 
2025-05-07T20:32:12.4757197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4757430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4757509Z     T=16384,
2025-05-07T20:32:12.4757582Z     D=5120,
2025-05-07T20:32:12.4757669Z     scale_ub=None,
2025-05-07T20:32:12.4757754Z     contiguous=True,
2025-05-07T20:32:12.4757838Z     compiled=False,
2025-05-07T20:32:12.4757916Z )
2025-05-07T20:32:12.4758130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4758312Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4758316Z 
2025-05-07T20:32:12.4758394Z     @given(
2025-05-07T20:32:12.4758511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4758619Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4758735Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4758850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4758968Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4759089Z     )
2025-05-07T20:32:12.4759333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4759431Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4759509Z         self,
2025-05-07T20:32:12.4759589Z         T: int,
2025-05-07T20:32:12.4759666Z         D: int,
2025-05-07T20:32:12.4759765Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4759860Z         contiguous: bool,
2025-05-07T20:32:12.4759948Z         compiled: bool,
2025-05-07T20:32:12.4760023Z     ) -> None:
2025-05-07T20:32:12.4760126Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4760253Z     
2025-05-07T20:32:12.4760421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4762275Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4762284Z 
2025-05-07T20:32:12.4762405Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4762409Z 
2025-05-07T20:32:12.4762518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4762737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4762826Z     T=4096,
2025-05-07T20:32:12.4762901Z     D=5120,
2025-05-07T20:32:12.4762986Z     scale_ub=None,
2025-05-07T20:32:12.4763077Z     contiguous=True,
2025-05-07T20:32:12.4763162Z     compiled=False,
2025-05-07T20:32:12.4763238Z )
2025-05-07T20:32:12.4763455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4763700Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4763705Z 
2025-05-07T20:32:12.4763785Z     @given(
2025-05-07T20:32:12.4763908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4764009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4764131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4764252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4764365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4764445Z     )
2025-05-07T20:32:12.4764691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4764786Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4764873Z         self,
2025-05-07T20:32:12.4764948Z         T: int,
2025-05-07T20:32:12.4765024Z         D: int,
2025-05-07T20:32:12.4765128Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4765216Z         contiguous: bool,
2025-05-07T20:32:12.4765305Z         compiled: bool,
2025-05-07T20:32:12.4765392Z     ) -> None:
2025-05-07T20:32:12.4765485Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4765559Z     
2025-05-07T20:32:12.4765734Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4767975Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4767991Z 
2025-05-07T20:32:12.4768109Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4768113Z 
2025-05-07T20:32:12.4768262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4768490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4768569Z     T=2048,
2025-05-07T20:32:12.4768645Z     D=5120,
2025-05-07T20:32:12.4768733Z     scale_ub=None,
2025-05-07T20:32:12.4768821Z     contiguous=False,
2025-05-07T20:32:12.4768903Z     compiled=False,
2025-05-07T20:32:12.4768984Z )
2025-05-07T20:32:12.4769194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4769371Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4769416Z 
2025-05-07T20:32:12.4769493Z     @given(
2025-05-07T20:32:12.4769648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4769759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4769872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4769991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4770114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4770189Z     )
2025-05-07T20:32:12.4770436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4770534Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4770610Z         self,
2025-05-07T20:32:12.4770692Z         T: int,
2025-05-07T20:32:12.4770766Z         D: int,
2025-05-07T20:32:12.4770864Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4770959Z         contiguous: bool,
2025-05-07T20:32:12.4771043Z         compiled: bool,
2025-05-07T20:32:12.4771122Z     ) -> None:
2025-05-07T20:32:12.4771221Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4771294Z     
2025-05-07T20:32:12.4771463Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4773280Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4773290Z 
2025-05-07T20:32:12.4773407Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4773411Z 
2025-05-07T20:32:12.4773520Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4773742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4773822Z     T=4096,
2025-05-07T20:32:12.4773904Z     D=7168,
2025-05-07T20:32:12.4773985Z     scale_ub=None,
2025-05-07T20:32:12.4774074Z     contiguous=True,
2025-05-07T20:32:12.4774156Z     compiled=True,
2025-05-07T20:32:12.4774227Z )
2025-05-07T20:32:12.4774447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4774620Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4774625Z 
2025-05-07T20:32:12.4774703Z     @given(
2025-05-07T20:32:12.4774827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4774925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4775045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4775159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4775273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4775356Z     )
2025-05-07T20:32:12.4775607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4775700Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4775785Z         self,
2025-05-07T20:32:12.4775861Z         T: int,
2025-05-07T20:32:12.4775935Z         D: int,
2025-05-07T20:32:12.4776040Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4776177Z         contiguous: bool,
2025-05-07T20:32:12.4776265Z         compiled: bool,
2025-05-07T20:32:12.4776348Z     ) -> None:
2025-05-07T20:32:12.4776444Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4776521Z     
2025-05-07T20:32:12.4776693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4778506Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4778556Z 
2025-05-07T20:32:12.4778674Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4778684Z 
2025-05-07T20:32:12.4778789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4779014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4779091Z     T=2048,
2025-05-07T20:32:12.4779167Z     D=5120,
2025-05-07T20:32:12.4779260Z     scale_ub=1200.0,
2025-05-07T20:32:12.4779345Z     contiguous=False,
2025-05-07T20:32:12.4779429Z     compiled=False,
2025-05-07T20:32:12.4779513Z )
2025-05-07T20:32:12.4779727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4779983Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4779988Z 
2025-05-07T20:32:12.4780067Z     @given(
2025-05-07T20:32:12.4780185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4780291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4780406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4780574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4780701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4780775Z     )
2025-05-07T20:32:12.4781017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4781118Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4781197Z         self,
2025-05-07T20:32:12.4781278Z         T: int,
2025-05-07T20:32:12.4781355Z         D: int,
2025-05-07T20:32:12.4781451Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4781546Z         contiguous: bool,
2025-05-07T20:32:12.4781635Z         compiled: bool,
2025-05-07T20:32:12.4781712Z     ) -> None:
2025-05-07T20:32:12.4781815Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4781888Z     
2025-05-07T20:32:12.4782056Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4783829Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4783836Z 
2025-05-07T20:32:12.4783951Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4783959Z 
2025-05-07T20:32:12.4784065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4784287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4784368Z     T=4096,
2025-05-07T20:32:12.4784441Z     D=7168,
2025-05-07T20:32:12.4784521Z     scale_ub=1200.0,
2025-05-07T20:32:12.4784614Z     contiguous=True,
2025-05-07T20:32:12.4784697Z     compiled=False,
2025-05-07T20:32:12.4784819Z )
2025-05-07T20:32:12.4785042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4785213Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4785218Z 
2025-05-07T20:32:12.4785294Z     @given(
2025-05-07T20:32:12.4785414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4785511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4785630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4785745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4785902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4785984Z     )
2025-05-07T20:32:12.4786265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4786358Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4786437Z         self,
2025-05-07T20:32:12.4786511Z         T: int,
2025-05-07T20:32:12.4786589Z         D: int,
2025-05-07T20:32:12.4786700Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4786787Z         contiguous: bool,
2025-05-07T20:32:12.4786872Z         compiled: bool,
2025-05-07T20:32:12.4786956Z     ) -> None:
2025-05-07T20:32:12.4787049Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4787128Z     
2025-05-07T20:32:12.4787293Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4789064Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4789080Z 
2025-05-07T20:32:12.4789241Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4789246Z 
2025-05-07T20:32:12.4789352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4789575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4789652Z     T=16384,
2025-05-07T20:32:12.4789730Z     D=7168,
2025-05-07T20:32:12.4790120Z     scale_ub=None,
2025-05-07T20:32:12.4790252Z     contiguous=False,
2025-05-07T20:32:12.4790336Z     compiled=True,
2025-05-07T20:32:12.4790416Z )
2025-05-07T20:32:12.4790635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4790822Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.4790826Z 
2025-05-07T20:32:12.4790902Z     @given(
2025-05-07T20:32:12.4791017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4791124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4791242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4791360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4791477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4791549Z     )
2025-05-07T20:32:12.4791790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4791889Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4791967Z         self,
2025-05-07T20:32:12.4792047Z         T: int,
2025-05-07T20:32:12.4792123Z         D: int,
2025-05-07T20:32:12.4792224Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4792320Z         contiguous: bool,
2025-05-07T20:32:12.4792404Z         compiled: bool,
2025-05-07T20:32:12.4792484Z     ) -> None:
2025-05-07T20:32:12.4792586Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4792657Z     
2025-05-07T20:32:12.4792822Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4794681Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4794827Z 
2025-05-07T20:32:12.4794948Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4794958Z 
2025-05-07T20:32:12.4795139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4795360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4795444Z     T=4096,
2025-05-07T20:32:12.4795521Z     D=7168,
2025-05-07T20:32:12.4795604Z     scale_ub=None,
2025-05-07T20:32:12.4795698Z     contiguous=True,
2025-05-07T20:32:12.4795788Z     compiled=False,
2025-05-07T20:32:12.4795863Z )
2025-05-07T20:32:12.4796084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4796254Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4796259Z 
2025-05-07T20:32:12.4801857Z     @given(
2025-05-07T20:32:12.4802019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4802126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4802251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4802392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4802514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4802602Z     )
2025-05-07T20:32:12.4802855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4802959Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4803049Z         self,
2025-05-07T20:32:12.4803136Z         T: int,
2025-05-07T20:32:12.4803322Z         D: int,
2025-05-07T20:32:12.4803437Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4803532Z         contiguous: bool,
2025-05-07T20:32:12.4803622Z         compiled: bool,
2025-05-07T20:32:12.4803715Z     ) -> None:
2025-05-07T20:32:12.4803815Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4803892Z     
2025-05-07T20:32:12.4804072Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4805863Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4805881Z 
2025-05-07T20:32:12.4806006Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4806011Z 
2025-05-07T20:32:12.4806118Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4806351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4806432Z     T=16384,
2025-05-07T20:32:12.4806512Z     D=7168,
2025-05-07T20:32:12.4806605Z     scale_ub=None,
2025-05-07T20:32:12.4806695Z     contiguous=True,
2025-05-07T20:32:12.4806786Z     compiled=False,
2025-05-07T20:32:12.4806870Z )
2025-05-07T20:32:12.4807091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4807272Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:12.4807283Z 
2025-05-07T20:32:12.4807362Z     @given(
2025-05-07T20:32:12.4807485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4807652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4807771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4807892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4808016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4808094Z     )
2025-05-07T20:32:12.4808343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4808449Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4808529Z         self,
2025-05-07T20:32:12.4808610Z         T: int,
2025-05-07T20:32:12.4808739Z         D: int,
2025-05-07T20:32:12.4808844Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4808983Z         contiguous: bool,
2025-05-07T20:32:12.4809074Z         compiled: bool,
2025-05-07T20:32:12.4809154Z     ) -> None:
2025-05-07T20:32:12.4809262Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4809339Z     
2025-05-07T20:32:12.4809507Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4811290Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4811298Z 
2025-05-07T20:32:12.4811417Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4811425Z 
2025-05-07T20:32:12.4811537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4811758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4811844Z     T=16384,
2025-05-07T20:32:12.4811920Z     D=7168,
2025-05-07T20:32:12.4812052Z     scale_ub=1200.0,
2025-05-07T20:32:12.4812146Z     contiguous=True,
2025-05-07T20:32:12.4812232Z     compiled=False,
2025-05-07T20:32:12.4812311Z )
2025-05-07T20:32:12.4812530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4812711Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4812715Z 
2025-05-07T20:32:12.4812795Z     @given(
2025-05-07T20:32:12.4812919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4813022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4813142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4813274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4813390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4813469Z     )
2025-05-07T20:32:12.4813714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4813811Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4813896Z         self,
2025-05-07T20:32:12.4813974Z         T: int,
2025-05-07T20:32:12.4814051Z         D: int,
2025-05-07T20:32:12.4814161Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4814257Z         contiguous: bool,
2025-05-07T20:32:12.4814343Z         compiled: bool,
2025-05-07T20:32:12.4814432Z     ) -> None:
2025-05-07T20:32:12.4814527Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4814599Z     
2025-05-07T20:32:12.4814772Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4816584Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4816643Z 
2025-05-07T20:32:12.4816764Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4816768Z 
2025-05-07T20:32:12.4816872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4817101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4817177Z     T=128,
2025-05-07T20:32:12.4817255Z     D=5120,
2025-05-07T20:32:12.4817388Z     scale_ub=1200.0,
2025-05-07T20:32:12.4817476Z     contiguous=False,
2025-05-07T20:32:12.4817560Z     compiled=False,
2025-05-07T20:32:12.4817637Z )
2025-05-07T20:32:12.4817894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4818071Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.4818083Z 
2025-05-07T20:32:12.4818161Z     @given(
2025-05-07T20:32:12.4818285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4818393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4818511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4818629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4818749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4818823Z     )
2025-05-07T20:32:12.4819068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4819171Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4819249Z         self,
2025-05-07T20:32:12.4819332Z         T: int,
2025-05-07T20:32:12.4819408Z         D: int,
2025-05-07T20:32:12.4819509Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4819603Z         contiguous: bool,
2025-05-07T20:32:12.4819690Z         compiled: bool,
2025-05-07T20:32:12.4819767Z     ) -> None:
2025-05-07T20:32:12.4819974Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4820053Z     
2025-05-07T20:32:12.4820276Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4820354Z     
2025-05-07T20:32:12.4820446Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4820571Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4820665Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4820747Z         x0 = x[:, :D]
2025-05-07T20:32:12.4820829Z         x1 = x[:, D:]
2025-05-07T20:32:12.4820906Z     
2025-05-07T20:32:12.4820992Z         if contiguous:
2025-05-07T20:32:12.4821091Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4821185Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4821258Z     
2025-05-07T20:32:12.4821356Z         if scale_ub is not None:
2025-05-07T20:32:12.4821462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4821602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4821688Z             )
2025-05-07T20:32:12.4821765Z         else:
2025-05-07T20:32:12.4821862Z             scale_ub_tensor = None
2025-05-07T20:32:12.4821941Z     
2025-05-07T20:32:12.4822074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4822169Z             op = silu_mul_quant
2025-05-07T20:32:12.4822257Z             if compiled:
2025-05-07T20:32:12.4822359Z                 op = torch.compile(op)
2025-05-07T20:32:12.4822471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4822550Z     
2025-05-07T20:32:12.4822641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4822645Z 
2025-05-07T20:32:12.4822747Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4822883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4822986Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4823084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4823594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4823744Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4824106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4824326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4824664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4824766Z     kernel = self.compile(
2025-05-07T20:32:12.4825147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4825402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4825536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4825540Z 
2025-05-07T20:32:12.4825746Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aa827b20>
2025-05-07T20:32:12.4826544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4827045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aa858940>}
2025-05-07T20:32:12.4827793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4827990Z context = <triton._C.libtriton.ir.context object at 0x7f06aa8908b0>
2025-05-07T20:32:12.4827995Z 
2025-05-07T20:32:12.4828163Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4828428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4828605Z                            module_map=module_map)
2025-05-07T20:32:12.4828779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4828877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4828955Z E       ^
2025-05-07T20:32:12.4829313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4829318Z 
2025-05-07T20:32:12.4829726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4829732Z 
2025-05-07T20:32:12.4829845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4830068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4830145Z     T=2048,
2025-05-07T20:32:12.4830227Z     D=7168,
2025-05-07T20:32:12.4830313Z     scale_ub=None,
2025-05-07T20:32:12.4830396Z     contiguous=False,
2025-05-07T20:32:12.4830486Z     compiled=False,
2025-05-07T20:32:12.4830560Z )
2025-05-07T20:32:12.4830773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4830953Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.4830957Z 
2025-05-07T20:32:12.4831033Z     @given(
2025-05-07T20:32:12.4831150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4831256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4831369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4831492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4831605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4831681Z     )
2025-05-07T20:32:12.4831937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4832033Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4832110Z         self,
2025-05-07T20:32:12.4832188Z         T: int,
2025-05-07T20:32:12.4832312Z         D: int,
2025-05-07T20:32:12.4832416Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4832512Z         contiguous: bool,
2025-05-07T20:32:12.4832597Z         compiled: bool,
2025-05-07T20:32:12.4832681Z     ) -> None:
2025-05-07T20:32:12.4832778Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4832850Z     
2025-05-07T20:32:12.4833021Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4834866Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4834914Z 
2025-05-07T20:32:12.4835038Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4835043Z 
2025-05-07T20:32:12.4835146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4835364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4835446Z     T=128,
2025-05-07T20:32:12.4835522Z     D=7168,
2025-05-07T20:32:12.4835608Z     scale_ub=1200.0,
2025-05-07T20:32:12.4835700Z     contiguous=True,
2025-05-07T20:32:12.4835782Z     compiled=True,
2025-05-07T20:32:12.4835859Z )
2025-05-07T20:32:12.4836076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4836245Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4836250Z 
2025-05-07T20:32:12.4836330Z     @given(
2025-05-07T20:32:12.4836447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4836545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4836709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4836831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4836941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4837020Z     )
2025-05-07T20:32:12.4837272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4837373Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4837448Z         self,
2025-05-07T20:32:12.4837526Z         T: int,
2025-05-07T20:32:12.4837607Z         D: int,
2025-05-07T20:32:12.4837711Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4837802Z         contiguous: bool,
2025-05-07T20:32:12.4837892Z         compiled: bool,
2025-05-07T20:32:12.4837971Z     ) -> None:
2025-05-07T20:32:12.4838068Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4838142Z     
2025-05-07T20:32:12.4838312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4838388Z     
2025-05-07T20:32:12.4838484Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4838610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4838703Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4838783Z         x0 = x[:, :D]
2025-05-07T20:32:12.4838865Z         x1 = x[:, D:]
2025-05-07T20:32:12.4838940Z     
2025-05-07T20:32:12.4839025Z         if contiguous:
2025-05-07T20:32:12.4839117Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4839211Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4839285Z     
2025-05-07T20:32:12.4839380Z         if scale_ub is not None:
2025-05-07T20:32:12.4839486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4839623Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4839700Z             )
2025-05-07T20:32:12.4839784Z         else:
2025-05-07T20:32:12.4839880Z             scale_ub_tensor = None
2025-05-07T20:32:12.4839953Z     
2025-05-07T20:32:12.4840090Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4840229Z             op = silu_mul_quant
2025-05-07T20:32:12.4840317Z             if compiled:
2025-05-07T20:32:12.4840419Z                 op = torch.compile(op)
2025-05-07T20:32:12.4840526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4840607Z     
2025-05-07T20:32:12.4840697Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4840701Z 
2025-05-07T20:32:12.4840800Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4840934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4841078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4841181Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4842087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4842186Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4842696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4842796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4843159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4843382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4843727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4843827Z     kernel = self.compile(
2025-05-07T20:32:12.4844209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4844385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4844515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4844520Z 
2025-05-07T20:32:12.4844725Z self = <triton.compiler.compiler.ASTSource object at 0x7f06aa7ad810>
2025-05-07T20:32:12.4845557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4846060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f081496eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f06aa858dc0>}
2025-05-07T20:32:12.4846805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4847001Z context = <triton._C.libtriton.ir.context object at 0x7f06aa7d9b70>
2025-05-07T20:32:12.4847006Z 
2025-05-07T20:32:12.4847171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4847441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4847551Z                            module_map=module_map)
2025-05-07T20:32:12.4847718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4847823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4847900Z E       ^
2025-05-07T20:32:12.4848253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4848258Z 
2025-05-07T20:32:12.4848673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4848680Z 
2025-05-07T20:32:12.4848784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4849010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4849088Z     T=128,
2025-05-07T20:32:12.4849165Z     D=7168,
2025-05-07T20:32:12.4849251Z     scale_ub=1200.0,
2025-05-07T20:32:12.4849455Z     contiguous=True,
2025-05-07T20:32:12.4849541Z     compiled=False,
2025-05-07T20:32:12.4849618Z )
2025-05-07T20:32:12.4849837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4850010Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.4850014Z 
2025-05-07T20:32:12.4850091Z     @given(
2025-05-07T20:32:12.4850210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4850312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4850482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4850598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4850759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4850836Z     )
2025-05-07T20:32:12.4851083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4851179Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4851260Z         self,
2025-05-07T20:32:12.4851344Z         T: int,
2025-05-07T20:32:12.4851421Z         D: int,
2025-05-07T20:32:12.4851520Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4851614Z         contiguous: bool,
2025-05-07T20:32:12.4851699Z         compiled: bool,
2025-05-07T20:32:12.4851778Z     ) -> None:
2025-05-07T20:32:12.4851876Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4851950Z     
2025-05-07T20:32:12.4852121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4852200Z     
2025-05-07T20:32:12.4852297Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4852424Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4854236Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4854244Z 
2025-05-07T20:32:12.4854367Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.4854376Z 
2025-05-07T20:32:12.4854481Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4854701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4854784Z     T=128,
2025-05-07T20:32:12.4854859Z     D=5120,
2025-05-07T20:32:12.4854941Z     scale_ub=1200.0,
2025-05-07T20:32:12.4855030Z     contiguous=True,
2025-05-07T20:32:12.4855112Z     compiled=True,
2025-05-07T20:32:12.4855187Z )
2025-05-07T20:32:12.4855414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4855582Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4855591Z 
2025-05-07T20:32:12.4855668Z     @given(
2025-05-07T20:32:12.4855792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4855889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4856012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4856126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4856237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4856312Z     )
2025-05-07T20:32:12.4856554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4856650Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4856732Z         self,
2025-05-07T20:32:12.4856808Z         T: int,
2025-05-07T20:32:12.4856884Z         D: int,
2025-05-07T20:32:12.4856985Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4857075Z         contiguous: bool,
2025-05-07T20:32:12.4857166Z         compiled: bool,
2025-05-07T20:32:12.4857289Z     ) -> None:
2025-05-07T20:32:12.4857391Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4857468Z     
2025-05-07T20:32:12.4857635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4857709Z     
2025-05-07T20:32:12.4857808Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4857935Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4859740Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4859788Z 
2025-05-07T20:32:12.4859973Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.4859977Z 
2025-05-07T20:32:12.4860081Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4860302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4860378Z     T=128,
2025-05-07T20:32:12.4860459Z     D=7168,
2025-05-07T20:32:12.4860542Z     scale_ub=None,
2025-05-07T20:32:12.4860629Z     contiguous=True,
2025-05-07T20:32:12.4860720Z     compiled=True,
2025-05-07T20:32:12.4860790Z )
2025-05-07T20:32:12.4861010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4861184Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.4861189Z 
2025-05-07T20:32:12.4861267Z     @given(
2025-05-07T20:32:12.4861384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4861484Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4861599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4861770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4861887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4861966Z     )
2025-05-07T20:32:12.4862215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4862308Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4862385Z         self,
2025-05-07T20:32:12.4862466Z         T: int,
2025-05-07T20:32:12.4862541Z         D: int,
2025-05-07T20:32:12.4862638Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4862737Z         contiguous: bool,
2025-05-07T20:32:12.4862823Z         compiled: bool,
2025-05-07T20:32:12.4862908Z     ) -> None:
2025-05-07T20:32:12.4863004Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4863076Z     
2025-05-07T20:32:12.4863249Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4865021Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.4865029Z 
2025-05-07T20:32:12.4865154Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.4865291Z =============================== warnings summary ===============================
2025-05-07T20:32:12.4865601Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:12.4865907Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:12.4866277Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:12.4867173Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:12.4867401Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:12.4867405Z 
2025-05-07T20:32:12.4867664Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:12.4867869Z ================= 1 failed, 1 deselected, 3 warnings in 21.69s =================
2025-05-07T20:32:14.1714508Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:14.2345037Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:14.2345361Z 
2025-05-07T20:32:16.2361049Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:18.3766558Z ============================= test session starts ==============================
2025-05-07T20:32:18.3767180Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:18.3767724Z cachedir: .pytest_cache
2025-05-07T20:32:18.3768306Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:18.3769024Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:18.3769433Z plugins: hypothesis-6.131.14
2025-05-07T20:32:19.9752452Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:20.1531998Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:20.1532396Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:20.1532628Z 
2025-05-07T20:32:22.6557800Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6558634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6559043Z     T=1,
2025-05-07T20:32:22.6559243Z     D=5120,
2025-05-07T20:32:22.6559462Z     scale_ub=None,
2025-05-07T20:32:22.6559678Z     contiguous=True,
2025-05-07T20:32:22.6559908Z     compiled=True,
2025-05-07T20:32:22.6560110Z )
2025-05-07T20:32:22.6560439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6560930Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.6561191Z 
2025-05-07T20:32:22.6561273Z     @given(
2025-05-07T20:32:22.6561516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6561829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6562136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6562460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6562792Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6563083Z     )
2025-05-07T20:32:22.6563435Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6563882Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6564138Z         self,
2025-05-07T20:32:22.6564337Z         T: int,
2025-05-07T20:32:22.6564539Z         D: int,
2025-05-07T20:32:22.6564770Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6565045Z         contiguous: bool,
2025-05-07T20:32:22.6565368Z         compiled: bool,
2025-05-07T20:32:22.6565672Z     ) -> None:
2025-05-07T20:32:22.6565895Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6566508Z     
2025-05-07T20:32:22.6566789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6567136Z     
2025-05-07T20:32:22.6567328Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6567622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6567934Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6568171Z         x0 = x[:, :D]
2025-05-07T20:32:22.6568390Z         x1 = x[:, D:]
2025-05-07T20:32:22.6568602Z     
2025-05-07T20:32:22.6568788Z         if contiguous:
2025-05-07T20:32:22.6569023Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6569385Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6569622Z     
2025-05-07T20:32:22.6569910Z         if scale_ub is not None:
2025-05-07T20:32:22.6570190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6570524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6570827Z             )
2025-05-07T20:32:22.6571023Z         else:
2025-05-07T20:32:22.6571242Z             scale_ub_tensor = None
2025-05-07T20:32:22.6571490Z     
2025-05-07T20:32:22.6571725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6572041Z             op = silu_mul_quant
2025-05-07T20:32:22.6572285Z             if compiled:
2025-05-07T20:32:22.6572537Z                 op = torch.compile(op)
2025-05-07T20:32:22.6572837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6573104Z     
2025-05-07T20:32:22.6573303Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.6573590Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.6573873Z     
2025-05-07T20:32:22.6574109Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6574454Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.6574741Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.6575060Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.6575417Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.6575813Z     
2025-05-07T20:32:22.6576018Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.6576219Z 
2025-05-07T20:32:22.6576320Z moe/activation_test.py:126: 
2025-05-07T20:32:22.6576625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6576951Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.6577282Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.6578079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.6578834Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.6579382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6580218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6580912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.6581642Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.6582384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.6583128Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.6583854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.6584495Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.6585096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.6585615Z     fn()
2025-05-07T20:32:22.6586122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.6586764Z     self.fn.run(
2025-05-07T20:32:22.6587234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6587772Z     kernel = self.compile(
2025-05-07T20:32:22.6588304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6588956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6589350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6589705Z 
2025-05-07T20:32:22.6590330Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d493eb0>
2025-05-07T20:32:22.6591416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6592802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09d57caf0>}
2025-05-07T20:32:22.6594131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6595161Z context = <triton._C.libtriton.ir.context object at 0x7fd0de492930>
2025-05-07T20:32:22.6595449Z 
2025-05-07T20:32:22.6595623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6596136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6596598Z                            module_map=module_map)
2025-05-07T20:32:22.6596967Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6597316Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.6597657Z E       ^
2025-05-07T20:32:22.6598122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6598564Z 
2025-05-07T20:32:22.6598984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6599487Z 
2025-05-07T20:32:22.6599594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6600003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6600405Z     T=2048,
2025-05-07T20:32:22.6600591Z     D=5120,
2025-05-07T20:32:22.6600786Z     scale_ub=1200.0,
2025-05-07T20:32:22.6601013Z     contiguous=True,
2025-05-07T20:32:22.6601229Z     compiled=False,
2025-05-07T20:32:22.6601441Z )
2025-05-07T20:32:24.0065886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0066485Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.0066803Z 
2025-05-07T20:32:24.0066886Z     @given(
2025-05-07T20:32:24.0067135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0067466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0067783Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0076189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0076586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0076877Z     )
2025-05-07T20:32:24.0077245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0077698Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0077949Z         self,
2025-05-07T20:32:24.0078157Z         T: int,
2025-05-07T20:32:24.0078365Z         D: int,
2025-05-07T20:32:24.0078594Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0078873Z         contiguous: bool,
2025-05-07T20:32:24.0079120Z         compiled: bool,
2025-05-07T20:32:24.0079593Z     ) -> None:
2025-05-07T20:32:24.0079831Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0080080Z     
2025-05-07T20:32:24.0080356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0080712Z     
2025-05-07T20:32:24.0080914Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0081211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0081527Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0081779Z         x0 = x[:, :D]
2025-05-07T20:32:24.0082001Z         x1 = x[:, D:]
2025-05-07T20:32:24.0082300Z     
2025-05-07T20:32:24.0082497Z         if contiguous:
2025-05-07T20:32:24.0082743Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0083073Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0083327Z     
2025-05-07T20:32:24.0083532Z         if scale_ub is not None:
2025-05-07T20:32:24.0083811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0084155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0084478Z             )
2025-05-07T20:32:24.0084675Z         else:
2025-05-07T20:32:24.0084901Z             scale_ub_tensor = None
2025-05-07T20:32:24.0085161Z     
2025-05-07T20:32:24.0085398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0085719Z             op = silu_mul_quant
2025-05-07T20:32:24.0085980Z             if compiled:
2025-05-07T20:32:24.0086235Z                 op = torch.compile(op)
2025-05-07T20:32:24.0086539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0086829Z     
2025-05-07T20:32:24.0087033Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.0087210Z 
2025-05-07T20:32:24.0087314Z moe/activation_test.py:117: 
2025-05-07T20:32:24.0087629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0087971Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.0088255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0089028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.0089742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.0090551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0091251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0091926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0092469Z     kernel = self.compile(
2025-05-07T20:32:24.0093011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0093671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0094076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0094305Z 
2025-05-07T20:32:24.0094533Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d369960>
2025-05-07T20:32:24.0095602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0097004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09d45d990>}
2025-05-07T20:32:24.0098349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0099375Z context = <triton._C.libtriton.ir.context object at 0x7fd09d7c6230>
2025-05-07T20:32:24.0099665Z 
2025-05-07T20:32:24.0099937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0100532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0101004Z                            module_map=module_map)
2025-05-07T20:32:24.0101380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0101741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0102000Z E       ^
2025-05-07T20:32:24.0102467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0102986Z 
2025-05-07T20:32:24.0103414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0103976Z 
2025-05-07T20:32:24.0104094Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.0104507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.0104915Z     T=2048,
2025-05-07T20:32:24.0105110Z     D=5120,
2025-05-07T20:32:24.0105309Z     scale_ub=1200.0,
2025-05-07T20:32:24.0105541Z     contiguous=True,
2025-05-07T20:32:24.0105775Z     compiled=True,
2025-05-07T20:32:24.0105984Z )
2025-05-07T20:32:24.0106318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0106816Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.0107085Z 
2025-05-07T20:32:24.0107165Z     @given(
2025-05-07T20:32:24.0107412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0107777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0108095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0108426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0108767Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0109058Z     )
2025-05-07T20:32:24.0109407Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0109855Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0110163Z         self,
2025-05-07T20:32:24.0110364Z         T: int,
2025-05-07T20:32:24.0110566Z         D: int,
2025-05-07T20:32:24.0110791Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0111061Z         contiguous: bool,
2025-05-07T20:32:24.0111310Z         compiled: bool,
2025-05-07T20:32:24.0111541Z     ) -> None:
2025-05-07T20:32:24.0111758Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0112005Z     
2025-05-07T20:32:24.0112283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0112624Z     
2025-05-07T20:32:24.0112826Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0113125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0113437Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0113676Z         x0 = x[:, :D]
2025-05-07T20:32:24.0113900Z         x1 = x[:, D:]
2025-05-07T20:32:24.0114112Z     
2025-05-07T20:32:24.0114300Z         if contiguous:
2025-05-07T20:32:24.0114541Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0114808Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0115045Z     
2025-05-07T20:32:24.0115242Z         if scale_ub is not None:
2025-05-07T20:32:24.0115520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0115853Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0116169Z             )
2025-05-07T20:32:24.0116370Z         else:
2025-05-07T20:32:24.0116582Z             scale_ub_tensor = None
2025-05-07T20:32:24.0116838Z     
2025-05-07T20:32:24.0117077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0117395Z             op = silu_mul_quant
2025-05-07T20:32:24.0117655Z             if compiled:
2025-05-07T20:32:24.0117913Z                 op = torch.compile(op)
2025-05-07T20:32:24.0118215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0118492Z     
2025-05-07T20:32:24.0118693Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.0118981Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.0119340Z     
2025-05-07T20:32:24.0119589Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0119930Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.0120223Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.0120544Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.0120907Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.0121214Z     
2025-05-07T20:32:24.0121424Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.0121676Z 
2025-05-07T20:32:24.0121779Z moe/activation_test.py:126: 
2025-05-07T20:32:24.0122126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0122460Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.0122793Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.0123590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.0124337Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.0124888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0125567Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0126251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.0126970Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.0127758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:24.0128526Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.0129296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.0129943Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.0130543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.0131064Z     fn()
2025-05-07T20:32:24.0131566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.0132146Z     self.fn.run(
2025-05-07T20:32:24.0132618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0133160Z     kernel = self.compile(
2025-05-07T20:32:24.0133699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0134359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0134757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0134986Z 
2025-05-07T20:32:24.0135197Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d493cd0>
2025-05-07T20:32:24.0136270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0137630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097e2d3f0>}
2025-05-07T20:32:24.0138966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0140045Z context = <triton._C.libtriton.ir.context object at 0x7fd097d39330>
2025-05-07T20:32:24.0140378Z 
2025-05-07T20:32:24.0140548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0141076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0141544Z                            module_map=module_map)
2025-05-07T20:32:24.0141913Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0142270Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.0142542Z E       ^
2025-05-07T20:32:24.0143009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0143506Z 
2025-05-07T20:32:24.0143983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0144512Z 
2025-05-07T20:32:24.0144620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.0145038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.0145447Z     T=16384,
2025-05-07T20:32:24.0145646Z     D=7168,
2025-05-07T20:32:24.0145848Z     scale_ub=1200.0,
2025-05-07T20:32:24.0146083Z     contiguous=False,
2025-05-07T20:32:24.0146310Z     compiled=False,
2025-05-07T20:32:24.0146519Z )
2025-05-07T20:32:25.2086747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2087322Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2087628Z 
2025-05-07T20:32:25.2087721Z     @given(
2025-05-07T20:32:25.2087980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2088303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2088618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2088966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2089309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2089600Z     )
2025-05-07T20:32:25.2090340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2090796Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2091040Z         self,
2025-05-07T20:32:25.2091250Z         T: int,
2025-05-07T20:32:25.2091460Z         D: int,
2025-05-07T20:32:25.2091683Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2091967Z         contiguous: bool,
2025-05-07T20:32:25.2092226Z         compiled: bool,
2025-05-07T20:32:25.2092464Z     ) -> None:
2025-05-07T20:32:25.2092687Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2092945Z     
2025-05-07T20:32:25.2093232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2093572Z     
2025-05-07T20:32:25.2093783Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2094075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2094383Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2094634Z         x0 = x[:, :D]
2025-05-07T20:32:25.2094864Z         x1 = x[:, D:]
2025-05-07T20:32:25.2095077Z     
2025-05-07T20:32:25.2095271Z         if contiguous:
2025-05-07T20:32:25.2095513Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2095774Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2096021Z     
2025-05-07T20:32:25.2096225Z         if scale_ub is not None:
2025-05-07T20:32:25.2096500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2096841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2097148Z             )
2025-05-07T20:32:25.2097335Z         else:
2025-05-07T20:32:25.2097546Z             scale_ub_tensor = None
2025-05-07T20:32:25.2097789Z     
2025-05-07T20:32:25.2098019Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2098323Z             op = silu_mul_quant
2025-05-07T20:32:25.2098575Z             if compiled:
2025-05-07T20:32:25.2098827Z                 op = torch.compile(op)
2025-05-07T20:32:25.2099121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2099535Z     
2025-05-07T20:32:25.2099734Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2099998Z 
2025-05-07T20:32:25.2100099Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2100401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2100736Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2101016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2101708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2102471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2103067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2103745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2104409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2104946Z     kernel = self.compile(
2025-05-07T20:32:25.2105489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2106139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2106540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2106766Z 
2025-05-07T20:32:25.2106982Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09c32ae90>
2025-05-07T20:32:25.2108059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2109439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097e2ce50>}
2025-05-07T20:32:25.2110822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2111858Z context = <triton._C.libtriton.ir.context object at 0x7fd097db68b0>
2025-05-07T20:32:25.2112147Z 
2025-05-07T20:32:25.2112321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2112830Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2113300Z                            module_map=module_map)
2025-05-07T20:32:25.2113672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2114028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2114283Z E       ^
2025-05-07T20:32:25.2114745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2115198Z 
2025-05-07T20:32:25.2115622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2116129Z 
2025-05-07T20:32:25.2116232Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2116643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2117044Z     T=1,
2025-05-07T20:32:25.2117229Z     D=7168,
2025-05-07T20:32:25.2117420Z     scale_ub=None,
2025-05-07T20:32:25.2117632Z     contiguous=True,
2025-05-07T20:32:25.2117859Z     compiled=True,
2025-05-07T20:32:25.2118084Z )
2025-05-07T20:32:25.2118401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2118889Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.2119148Z 
2025-05-07T20:32:25.2119230Z     @given(
2025-05-07T20:32:25.2119459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2119815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2120122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2120442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2120769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2121055Z     )
2025-05-07T20:32:25.2121408Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2121844Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2122083Z         self,
2025-05-07T20:32:25.2122279Z         T: int,
2025-05-07T20:32:25.2122523Z         D: int,
2025-05-07T20:32:25.2122748Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2123025Z         contiguous: bool,
2025-05-07T20:32:25.2123300Z         compiled: bool,
2025-05-07T20:32:25.2123528Z     ) -> None:
2025-05-07T20:32:25.2123751Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2123992Z     
2025-05-07T20:32:25.2124269Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2124617Z     
2025-05-07T20:32:25.2124811Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2125106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2125419Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2125657Z         x0 = x[:, :D]
2025-05-07T20:32:25.2125879Z         x1 = x[:, D:]
2025-05-07T20:32:25.2126091Z     
2025-05-07T20:32:25.2126272Z         if contiguous:
2025-05-07T20:32:25.2126507Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2126773Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2127015Z     
2025-05-07T20:32:25.2127202Z         if scale_ub is not None:
2025-05-07T20:32:25.2127475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2127811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2128114Z             )
2025-05-07T20:32:25.2128305Z         else:
2025-05-07T20:32:25.2128523Z             scale_ub_tensor = None
2025-05-07T20:32:25.2128765Z     
2025-05-07T20:32:25.2129048Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2129360Z             op = silu_mul_quant
2025-05-07T20:32:25.2129608Z             if compiled:
2025-05-07T20:32:25.2129858Z                 op = torch.compile(op)
2025-05-07T20:32:25.2130155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2130423Z     
2025-05-07T20:32:25.2130617Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.2130910Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.2131194Z     
2025-05-07T20:32:25.2131440Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2131780Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.2132077Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.2132385Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.2132745Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2133054Z     
2025-05-07T20:32:25.2133256Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.2133460Z 
2025-05-07T20:32:25.2133560Z moe/activation_test.py:126: 
2025-05-07T20:32:25.2133857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2134183Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.2134515Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2135298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.2136051Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.2136591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2137271Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2137962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.2138743Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2139492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.2140318Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2141050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.2141689Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.2142372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.2142901Z     fn()
2025-05-07T20:32:25.2143410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.2143981Z     self.fn.run(
2025-05-07T20:32:25.2144453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2144984Z     kernel = self.compile(
2025-05-07T20:32:25.2145520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2146171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2146568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2146793Z 
2025-05-07T20:32:25.2147016Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d575600>
2025-05-07T20:32:25.2148085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2149507Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bc5000>}
2025-05-07T20:32:25.2150855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2151881Z context = <triton._C.libtriton.ir.context object at 0x7fd097ce1570>
2025-05-07T20:32:25.2152164Z 
2025-05-07T20:32:25.2152335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2152854Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2153325Z                            module_map=module_map)
2025-05-07T20:32:25.2153691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2154042Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.2154308Z E       ^
2025-05-07T20:32:25.2154773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2155227Z 
2025-05-07T20:32:25.2155645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2156167Z 
2025-05-07T20:32:25.2156270Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2156677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2157081Z     T=4096,
2025-05-07T20:32:25.2157267Z     D=5120,
2025-05-07T20:32:25.2157460Z     scale_ub=None,
2025-05-07T20:32:25.2157677Z     contiguous=False,
2025-05-07T20:32:25.2157917Z     compiled=False,
2025-05-07T20:32:25.2158155Z )
2025-05-07T20:32:26.7809845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.7810457Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:26.7810741Z 
2025-05-07T20:32:26.7810956Z     @given(
2025-05-07T20:32:26.7811207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.7811540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.7811855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.7812191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.7812527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.7812829Z     )
2025-05-07T20:32:26.7813182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.7813748Z     def test_silu_mul_quant(
2025-05-07T20:32:26.7814081Z         self,
2025-05-07T20:32:26.7814341Z         T: int,
2025-05-07T20:32:26.7821410Z         D: int,
2025-05-07T20:32:26.7821648Z         scale_ub: Optional[float],
2025-05-07T20:32:26.7821927Z         contiguous: bool,
2025-05-07T20:32:26.7822169Z         compiled: bool,
2025-05-07T20:32:26.7822389Z     ) -> None:
2025-05-07T20:32:26.7822602Z         torch.manual_seed(2025)
2025-05-07T20:32:26.7822846Z     
2025-05-07T20:32:26.7823122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.7823479Z     
2025-05-07T20:32:26.7823686Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.7823986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.7824307Z         x = x_sign * x_clamp
2025-05-07T20:32:26.7824558Z         x0 = x[:, :D]
2025-05-07T20:32:26.7824778Z         x1 = x[:, D:]
2025-05-07T20:32:26.7824998Z     
2025-05-07T20:32:26.7825194Z         if contiguous:
2025-05-07T20:32:26.7825443Z             x0 = x0.contiguous()
2025-05-07T20:32:26.7825704Z             x1 = x1.contiguous()
2025-05-07T20:32:26.7825950Z     
2025-05-07T20:32:26.7826157Z         if scale_ub is not None:
2025-05-07T20:32:26.7826433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.7826774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.7827090Z             )
2025-05-07T20:32:26.7827284Z         else:
2025-05-07T20:32:26.7827579Z             scale_ub_tensor = None
2025-05-07T20:32:26.7827843Z     
2025-05-07T20:32:26.7828081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.7828408Z             op = silu_mul_quant
2025-05-07T20:32:26.7828666Z             if compiled:
2025-05-07T20:32:26.7828919Z                 op = torch.compile(op)
2025-05-07T20:32:26.7829223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7829505Z     
2025-05-07T20:32:26.7829700Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.7829874Z 
2025-05-07T20:32:26.7829983Z moe/activation_test.py:117: 
2025-05-07T20:32:26.7830295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7832039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.7832322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7833015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.7833718Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.7834251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.7834936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.7835602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.7836134Z     kernel = self.compile(
2025-05-07T20:32:26.7836674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.7837335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7837734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7837960Z 
2025-05-07T20:32:26.7838178Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097be33a0>
2025-05-07T20:32:26.7839306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.7840672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bc5a20>}
2025-05-07T20:32:26.7842003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.7843110Z context = <triton._C.libtriton.ir.context object at 0x7fd097a87f30>
2025-05-07T20:32:26.7843398Z 
2025-05-07T20:32:26.7843568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.7844090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7844567Z                            module_map=module_map)
2025-05-07T20:32:26.7844934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7845281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7845547Z E       ^
2025-05-07T20:32:26.7846018Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.7846459Z 
2025-05-07T20:32:26.7846880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7847387Z 
2025-05-07T20:32:26.7847492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7847913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7848317Z     T=4096,
2025-05-07T20:32:26.7848503Z     D=7168,
2025-05-07T20:32:26.7848727Z     scale_ub=None,
2025-05-07T20:32:26.7848971Z     contiguous=False,
2025-05-07T20:32:26.7849265Z     compiled=False,
2025-05-07T20:32:26.7849485Z )
2025-05-07T20:32:26.7849809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.7850304Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:26.7850583Z 
2025-05-07T20:32:26.7850655Z     @given(
2025-05-07T20:32:26.7850889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.7851200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.7851508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.7851841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.7852166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.7852454Z     )
2025-05-07T20:32:26.7852806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.7853241Z     def test_silu_mul_quant(
2025-05-07T20:32:26.7853485Z         self,
2025-05-07T20:32:26.7853689Z         T: int,
2025-05-07T20:32:26.7853889Z         D: int,
2025-05-07T20:32:26.7854114Z         scale_ub: Optional[float],
2025-05-07T20:32:26.7854392Z         contiguous: bool,
2025-05-07T20:32:26.7854629Z         compiled: bool,
2025-05-07T20:32:26.7854851Z     ) -> None:
2025-05-07T20:32:26.7855070Z         torch.manual_seed(2025)
2025-05-07T20:32:26.7855304Z     
2025-05-07T20:32:26.7855579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.7855922Z     
2025-05-07T20:32:26.7856112Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.7856407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.7856722Z         x = x_sign * x_clamp
2025-05-07T20:32:26.7856969Z         x0 = x[:, :D]
2025-05-07T20:32:26.7857179Z         x1 = x[:, D:]
2025-05-07T20:32:26.7857386Z     
2025-05-07T20:32:26.7857575Z         if contiguous:
2025-05-07T20:32:26.7857799Z             x0 = x0.contiguous()
2025-05-07T20:32:26.7858056Z             x1 = x1.contiguous()
2025-05-07T20:32:26.7858353Z     
2025-05-07T20:32:26.7858564Z         if scale_ub is not None:
2025-05-07T20:32:26.7858868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.7859201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.7859502Z             )
2025-05-07T20:32:26.7859699Z         else:
2025-05-07T20:32:26.7860007Z             scale_ub_tensor = None
2025-05-07T20:32:26.7860248Z     
2025-05-07T20:32:26.7860484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.7860801Z             op = silu_mul_quant
2025-05-07T20:32:26.7861103Z             if compiled:
2025-05-07T20:32:26.7861356Z                 op = torch.compile(op)
2025-05-07T20:32:26.7861692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7861968Z     
2025-05-07T20:32:26.7862164Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.7862334Z 
2025-05-07T20:32:26.7862435Z moe/activation_test.py:117: 
2025-05-07T20:32:26.7862737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7863062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.7863352Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7864034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.7864711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.7865250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.7865934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.7866594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.7867117Z     kernel = self.compile(
2025-05-07T20:32:26.7867660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.7868361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7868763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7868989Z 
2025-05-07T20:32:26.7869198Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097b0a050>
2025-05-07T20:32:26.7870264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.7871646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bc6560>}
2025-05-07T20:32:26.7872982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.7873994Z context = <triton._C.libtriton.ir.context object at 0x7fd097ac7170>
2025-05-07T20:32:26.7874286Z 
2025-05-07T20:32:26.7874469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.7874992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7875457Z                            module_map=module_map)
2025-05-07T20:32:26.7875816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7876169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7876436Z E       ^
2025-05-07T20:32:26.7876899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.7877342Z 
2025-05-07T20:32:26.7877752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7878273Z 
2025-05-07T20:32:26.7878429Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7878848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7879238Z     T=128,
2025-05-07T20:32:26.7879422Z     D=7168,
2025-05-07T20:32:26.7879621Z     scale_ub=None,
2025-05-07T20:32:26.7879838Z     contiguous=False,
2025-05-07T20:32:26.7880059Z     compiled=True,
2025-05-07T20:32:26.7880260Z )
2025-05-07T20:32:26.8505870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.8506412Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:26.8506812Z 
2025-05-07T20:32:26.8506897Z     @given(
2025-05-07T20:32:26.8507200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.8507521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.8507832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.8508179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.8508521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.8508820Z     )
2025-05-07T20:32:26.8509173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.8509629Z     def test_silu_mul_quant(
2025-05-07T20:32:26.8509882Z         self,
2025-05-07T20:32:26.8510076Z         T: int,
2025-05-07T20:32:26.8510282Z         D: int,
2025-05-07T20:32:26.8510517Z         scale_ub: Optional[float],
2025-05-07T20:32:26.8510790Z         contiguous: bool,
2025-05-07T20:32:26.8511040Z         compiled: bool,
2025-05-07T20:32:26.8511281Z     ) -> None:
2025-05-07T20:32:26.8511497Z         torch.manual_seed(2025)
2025-05-07T20:32:26.8511753Z     
2025-05-07T20:32:26.8512042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.8512378Z     
2025-05-07T20:32:26.8512581Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.8512885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.8513194Z         x = x_sign * x_clamp
2025-05-07T20:32:26.8513514Z         x0 = x[:, :D]
2025-05-07T20:32:26.8513743Z         x1 = x[:, D:]
2025-05-07T20:32:26.8513956Z     
2025-05-07T20:32:26.8514143Z         if contiguous:
2025-05-07T20:32:26.8514383Z             x0 = x0.contiguous()
2025-05-07T20:32:26.8514645Z             x1 = x1.contiguous()
2025-05-07T20:32:26.8514883Z     
2025-05-07T20:32:26.8515082Z         if scale_ub is not None:
2025-05-07T20:32:26.8515358Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.8515689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.8515999Z             )
2025-05-07T20:32:26.8516194Z         else:
2025-05-07T20:32:26.8516403Z             scale_ub_tensor = None
2025-05-07T20:32:26.8516656Z     
2025-05-07T20:32:26.8516895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.8517214Z             op = silu_mul_quant
2025-05-07T20:32:26.8517460Z             if compiled:
2025-05-07T20:32:26.8517720Z                 op = torch.compile(op)
2025-05-07T20:32:26.8518023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.8518289Z     
2025-05-07T20:32:26.8518485Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.8518773Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.8519057Z     
2025-05-07T20:32:26.8519295Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.8519629Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.8519923Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.8520232Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.8520595Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.8520911Z     
2025-05-07T20:32:26.8521110Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.8521313Z 
2025-05-07T20:32:26.8521414Z moe/activation_test.py:126: 
2025-05-07T20:32:26.8521717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.8522146Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.8522475Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.8523260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.8524007Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.8524543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.8525262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.8525981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.8526714Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.8527456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:26.8528197Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.8528917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.8529558Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.8530148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.8530668Z     fn()
2025-05-07T20:32:26.8531178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.8531747Z     self.fn.run(
2025-05-07T20:32:26.8532214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.8532744Z     kernel = self.compile(
2025-05-07T20:32:26.8533328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.8533983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.8534384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.8534607Z 
2025-05-07T20:32:26.8534818Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097783610>
2025-05-07T20:32:26.8535889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.8537260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bca680>}
2025-05-07T20:32:26.8538596Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.8539626Z context = <triton._C.libtriton.ir.context object at 0x7fd097337a70>
2025-05-07T20:32:26.8539977Z 
2025-05-07T20:32:26.8540150Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.8540662Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.8541131Z                            module_map=module_map)
2025-05-07T20:32:26.8541503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.8541860Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.8542123Z E       ^
2025-05-07T20:32:26.8542588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.8543029Z 
2025-05-07T20:32:26.8543458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.8544011Z 
2025-05-07T20:32:26.8544122Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.8544527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.8544929Z     T=128,
2025-05-07T20:32:26.8545114Z     D=7168,
2025-05-07T20:32:26.8545304Z     scale_ub=None,
2025-05-07T20:32:26.8545520Z     contiguous=False,
2025-05-07T20:32:26.8545748Z     compiled=False,
2025-05-07T20:32:26.8545948Z )
2025-05-07T20:32:27.2181571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2182298Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.2182573Z 
2025-05-07T20:32:27.2182655Z     @given(
2025-05-07T20:32:27.2182901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2183225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2183532Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2183884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2184223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2184526Z     )
2025-05-07T20:32:27.2184877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2185321Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2185571Z         self,
2025-05-07T20:32:27.2185766Z         T: int,
2025-05-07T20:32:27.2186006Z         D: int,
2025-05-07T20:32:27.2186238Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2186524Z         contiguous: bool,
2025-05-07T20:32:27.2186773Z         compiled: bool,
2025-05-07T20:32:27.2187000Z     ) -> None:
2025-05-07T20:32:27.2187232Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2187482Z     
2025-05-07T20:32:27.2187753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2188106Z     
2025-05-07T20:32:27.2188315Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2188679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2189002Z         x = x_sign * x_clamp
2025-05-07T20:32:27.2189243Z         x0 = x[:, :D]
2025-05-07T20:32:27.2189455Z         x1 = x[:, D:]
2025-05-07T20:32:27.2189668Z     
2025-05-07T20:32:27.2190107Z         if contiguous:
2025-05-07T20:32:27.2190341Z             x0 = x0.contiguous()
2025-05-07T20:32:27.2190599Z             x1 = x1.contiguous()
2025-05-07T20:32:27.2190837Z     
2025-05-07T20:32:27.2191025Z         if scale_ub is not None:
2025-05-07T20:32:27.2191306Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.2191639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.2191947Z             )
2025-05-07T20:32:27.2192137Z         else:
2025-05-07T20:32:27.2192349Z             scale_ub_tensor = None
2025-05-07T20:32:27.2192600Z     
2025-05-07T20:32:27.2192828Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.2193149Z             op = silu_mul_quant
2025-05-07T20:32:27.2193401Z             if compiled:
2025-05-07T20:32:27.2193645Z                 op = torch.compile(op)
2025-05-07T20:32:27.2193942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2194210Z     
2025-05-07T20:32:27.2194400Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.2194571Z 
2025-05-07T20:32:27.2194672Z moe/activation_test.py:117: 
2025-05-07T20:32:27.2194970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2195294Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.2195580Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2196273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.2196965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.2197496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.2198254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.2198914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.2199442Z     kernel = self.compile(
2025-05-07T20:32:27.2199978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.2200630Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.2201026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2201316Z 
2025-05-07T20:32:27.2201576Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0977f3430>
2025-05-07T20:32:27.2202652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.2204030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097c25f30>}
2025-05-07T20:32:27.2205358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.2206375Z context = <triton._C.libtriton.ir.context object at 0x7fd0973b9a30>
2025-05-07T20:32:27.2206662Z 
2025-05-07T20:32:27.2206828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.2207358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.2207821Z                            module_map=module_map)
2025-05-07T20:32:27.2208181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.2208598Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.2208866Z E       ^
2025-05-07T20:32:27.2209329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.2209772Z 
2025-05-07T20:32:27.2210184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.2210696Z 
2025-05-07T20:32:27.2210799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2211211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2211618Z     T=4096,
2025-05-07T20:32:27.2211800Z     D=5120,
2025-05-07T20:32:27.2211996Z     scale_ub=1200.0,
2025-05-07T20:32:27.2212219Z     contiguous=True,
2025-05-07T20:32:27.2212437Z     compiled=False,
2025-05-07T20:32:27.2212644Z )
2025-05-07T20:32:27.2212961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2213455Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.2213734Z 
2025-05-07T20:32:27.2213811Z     @given(
2025-05-07T20:32:27.2214043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2214350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2214656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2214986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2215313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2215591Z     )
2025-05-07T20:32:27.2215944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2216382Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2216623Z         self,
2025-05-07T20:32:27.2216821Z         T: int,
2025-05-07T20:32:27.2217020Z         D: int,
2025-05-07T20:32:27.2217237Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2217508Z         contiguous: bool,
2025-05-07T20:32:27.2217801Z         compiled: bool,
2025-05-07T20:32:27.2218023Z     ) -> None:
2025-05-07T20:32:27.2218246Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2218490Z     
2025-05-07T20:32:27.2218757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2219097Z     
2025-05-07T20:32:27.2219290Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2219577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2219990Z         x = x_sign * x_clamp
2025-05-07T20:32:27.2220233Z         x0 = x[:, :D]
2025-05-07T20:32:27.2220451Z         x1 = x[:, D:]
2025-05-07T20:32:27.2220703Z     
2025-05-07T20:32:27.2220887Z         if contiguous:
2025-05-07T20:32:27.2221162Z             x0 = x0.contiguous()
2025-05-07T20:32:27.2221419Z             x1 = x1.contiguous()
2025-05-07T20:32:27.2221660Z     
2025-05-07T20:32:27.2221855Z         if scale_ub is not None:
2025-05-07T20:32:27.2222124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.2222459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.2222774Z             )
2025-05-07T20:32:27.2222959Z         else:
2025-05-07T20:32:27.2223171Z             scale_ub_tensor = None
2025-05-07T20:32:27.2223427Z     
2025-05-07T20:32:27.2223651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.2223962Z             op = silu_mul_quant
2025-05-07T20:32:27.2224222Z             if compiled:
2025-05-07T20:32:27.2230141Z                 op = torch.compile(op)
2025-05-07T20:32:27.2230463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2230751Z     
2025-05-07T20:32:27.2230944Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.2231120Z 
2025-05-07T20:32:27.2231224Z moe/activation_test.py:117: 
2025-05-07T20:32:27.2231520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2231852Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.2232131Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.2232898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.2233606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.2234136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.2234814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.2235476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.2236014Z     kernel = self.compile(
2025-05-07T20:32:27.2236556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.2237210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.2237607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.2237835Z 
2025-05-07T20:32:27.2238050Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0976961d0>
2025-05-07T20:32:27.2239116Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.2240480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097c25b40>}
2025-05-07T20:32:27.2241825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.2242860Z context = <triton._C.libtriton.ir.context object at 0x7fd09736a770>
2025-05-07T20:32:27.2243144Z 
2025-05-07T20:32:27.2243314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.2243888Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.2244353Z                            module_map=module_map)
2025-05-07T20:32:27.2244719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.2245070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.2245333Z E       ^
2025-05-07T20:32:27.2245800Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.2246289Z 
2025-05-07T20:32:27.2246747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.2247261Z 
2025-05-07T20:32:27.2247366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2247778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2248177Z     T=1,
2025-05-07T20:32:27.2248364Z     D=5120,
2025-05-07T20:32:27.2248566Z     scale_ub=None,
2025-05-07T20:32:27.2248781Z     contiguous=True,
2025-05-07T20:32:27.2249005Z     compiled=True,
2025-05-07T20:32:27.2249209Z )
2025-05-07T20:32:27.8005557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8006098Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.8006364Z 
2025-05-07T20:32:27.8006445Z     @given(
2025-05-07T20:32:27.8006691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8007010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8007318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8007664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8008005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8008284Z     )
2025-05-07T20:32:27.8008645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8009212Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8009471Z         self,
2025-05-07T20:32:27.8009664Z         T: int,
2025-05-07T20:32:27.8009876Z         D: int,
2025-05-07T20:32:27.8010099Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8010374Z         contiguous: bool,
2025-05-07T20:32:27.8010617Z         compiled: bool,
2025-05-07T20:32:27.8010845Z     ) -> None:
2025-05-07T20:32:27.8011064Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8011313Z     
2025-05-07T20:32:27.8011589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8011927Z     
2025-05-07T20:32:27.8012127Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.8012431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.8012736Z         x = x_sign * x_clamp
2025-05-07T20:32:27.8012988Z         x0 = x[:, :D]
2025-05-07T20:32:27.8013196Z         x1 = x[:, D:]
2025-05-07T20:32:27.8013411Z     
2025-05-07T20:32:27.8013599Z         if contiguous:
2025-05-07T20:32:27.8013843Z             x0 = x0.contiguous()
2025-05-07T20:32:27.8014108Z             x1 = x1.contiguous()
2025-05-07T20:32:27.8014353Z     
2025-05-07T20:32:27.8014542Z         if scale_ub is not None:
2025-05-07T20:32:27.8014823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.8015157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.8015461Z             )
2025-05-07T20:32:27.8015660Z         else:
2025-05-07T20:32:27.8015875Z             scale_ub_tensor = None
2025-05-07T20:32:27.8016124Z     
2025-05-07T20:32:27.8016376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.8016694Z             op = silu_mul_quant
2025-05-07T20:32:27.8016943Z             if compiled:
2025-05-07T20:32:27.8017196Z                 op = torch.compile(op)
2025-05-07T20:32:27.8017491Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.8017763Z     
2025-05-07T20:32:27.8017953Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.8018315Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.8018606Z     
2025-05-07T20:32:27.8018838Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.8019174Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.8019471Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.8019855Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.8020215Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.8020530Z     
2025-05-07T20:32:27.8020729Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.8021014Z 
2025-05-07T20:32:27.8021115Z moe/activation_test.py:126: 
2025-05-07T20:32:27.8021480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8021812Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.8022131Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.8022920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.8023669Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.8024206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.8024880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.8025567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.8026286Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.8027030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:27.8027788Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.8028554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.8029195Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.8029785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.8030305Z     fn()
2025-05-07T20:32:27.8030812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.8031385Z     self.fn.run(
2025-05-07T20:32:27.8031848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.8032376Z     kernel = self.compile(
2025-05-07T20:32:27.8032911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.8033550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.8033945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8034172Z 
2025-05-07T20:32:27.8034381Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097a373a0>
2025-05-07T20:32:27.8035451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.8036825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097c271c0>}
2025-05-07T20:32:27.8038157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.8039182Z context = <triton._C.libtriton.ir.context object at 0x7fd096e9a4f0>
2025-05-07T20:32:27.8039515Z 
2025-05-07T20:32:27.8039690Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.8040206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.8040664Z                            module_map=module_map)
2025-05-07T20:32:27.8041033Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.8041385Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.8041641Z E       ^
2025-05-07T20:32:27.8042095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.8042578Z 
2025-05-07T20:32:27.8043038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.8043551Z 
2025-05-07T20:32:27.8043659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8044063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8044461Z     T=2048,
2025-05-07T20:32:27.8044644Z     D=5120,
2025-05-07T20:32:27.8044831Z     scale_ub=None,
2025-05-07T20:32:27.8045050Z     contiguous=True,
2025-05-07T20:32:27.8045273Z     compiled=True,
2025-05-07T20:32:27.8045466Z )
2025-05-07T20:32:28.3409708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.3410254Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.3410523Z 
2025-05-07T20:32:28.3410615Z     @given(
2025-05-07T20:32:28.3410847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.3411171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.3411485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.3411819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.3412157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.3412444Z     )
2025-05-07T20:32:28.3412985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.3413428Z     def test_silu_mul_quant(
2025-05-07T20:32:28.3413676Z         self,
2025-05-07T20:32:28.3413873Z         T: int,
2025-05-07T20:32:28.3414070Z         D: int,
2025-05-07T20:32:28.3414293Z         scale_ub: Optional[float],
2025-05-07T20:32:28.3414563Z         contiguous: bool,
2025-05-07T20:32:28.3414800Z         compiled: bool,
2025-05-07T20:32:28.3415036Z     ) -> None:
2025-05-07T20:32:28.3415264Z         torch.manual_seed(2025)
2025-05-07T20:32:28.3415504Z     
2025-05-07T20:32:28.3415788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.3416136Z     
2025-05-07T20:32:28.3416333Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.3416625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.3416936Z         x = x_sign * x_clamp
2025-05-07T20:32:28.3417175Z         x0 = x[:, :D]
2025-05-07T20:32:28.3417401Z         x1 = x[:, D:]
2025-05-07T20:32:28.3417615Z     
2025-05-07T20:32:28.3417797Z         if contiguous:
2025-05-07T20:32:28.3418037Z             x0 = x0.contiguous()
2025-05-07T20:32:28.3418301Z             x1 = x1.contiguous()
2025-05-07T20:32:28.3418541Z     
2025-05-07T20:32:28.3418762Z         if scale_ub is not None:
2025-05-07T20:32:28.3419060Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.3419418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.3419725Z             )
2025-05-07T20:32:28.3420004Z         else:
2025-05-07T20:32:28.3420223Z             scale_ub_tensor = None
2025-05-07T20:32:28.3420471Z     
2025-05-07T20:32:28.3420699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.3421009Z             op = silu_mul_quant
2025-05-07T20:32:28.3421257Z             if compiled:
2025-05-07T20:32:28.3421510Z                 op = torch.compile(op)
2025-05-07T20:32:28.3421806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3422152Z     
2025-05-07T20:32:28.3422353Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.3422640Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.3422921Z     
2025-05-07T20:32:28.3423160Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.3423493Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.3423778Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.3424089Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.3424448Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.3424844Z     
2025-05-07T20:32:28.3425048Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.3425307Z 
2025-05-07T20:32:28.3425410Z moe/activation_test.py:126: 
2025-05-07T20:32:28.3425707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3426033Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.3426366Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.3427163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.3427906Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.3428447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.3429120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.3429799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.3430512Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.3431264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.3432048Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.3432781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.3433416Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.3434014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.3434535Z     fn()
2025-05-07T20:32:28.3435034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.3435612Z     self.fn.run(
2025-05-07T20:32:28.3436081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.3436611Z     kernel = self.compile(
2025-05-07T20:32:28.3437150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.3437804Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.3438199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3438422Z 
2025-05-07T20:32:28.3438634Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09718a200>
2025-05-07T20:32:28.3439699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.3441069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0977cf9a0>}
2025-05-07T20:32:28.3442395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.3443473Z context = <triton._C.libtriton.ir.context object at 0x7fd09724faf0>
2025-05-07T20:32:28.3443756Z 
2025-05-07T20:32:28.3443922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.3444437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.3444901Z                            module_map=module_map)
2025-05-07T20:32:28.3445271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.3445667Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.3445934Z E       ^
2025-05-07T20:32:28.3446431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.3446871Z 
2025-05-07T20:32:28.3447283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.3447798Z 
2025-05-07T20:32:28.3447908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.3448319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.3448727Z     T=128,
2025-05-07T20:32:28.3448907Z     D=5120,
2025-05-07T20:32:28.3449099Z     scale_ub=None,
2025-05-07T20:32:28.3449313Z     contiguous=True,
2025-05-07T20:32:28.3449532Z     compiled=True,
2025-05-07T20:32:28.3449738Z )
2025-05-07T20:32:29.2469452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.2470206Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.2470600Z 
2025-05-07T20:32:29.2470712Z     @given(
2025-05-07T20:32:29.2471046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.2471402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.2471715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.2472044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.2472526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.2472821Z     )
2025-05-07T20:32:29.2473207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.2473649Z     def test_silu_mul_quant(
2025-05-07T20:32:29.2473896Z         self,
2025-05-07T20:32:29.2474101Z         T: int,
2025-05-07T20:32:29.2474304Z         D: int,
2025-05-07T20:32:29.2474534Z         scale_ub: Optional[float],
2025-05-07T20:32:29.2474815Z         contiguous: bool,
2025-05-07T20:32:29.2475053Z         compiled: bool,
2025-05-07T20:32:29.2475290Z     ) -> None:
2025-05-07T20:32:29.2475518Z         torch.manual_seed(2025)
2025-05-07T20:32:29.2475756Z     
2025-05-07T20:32:29.2476044Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.2476397Z     
2025-05-07T20:32:29.2476592Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.2476892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.2477211Z         x = x_sign * x_clamp
2025-05-07T20:32:29.2477464Z         x0 = x[:, :D]
2025-05-07T20:32:29.2477679Z         x1 = x[:, D:]
2025-05-07T20:32:29.2477887Z     
2025-05-07T20:32:29.2478077Z         if contiguous:
2025-05-07T20:32:29.2478306Z             x0 = x0.contiguous()
2025-05-07T20:32:29.2478565Z             x1 = x1.contiguous()
2025-05-07T20:32:29.2478805Z     
2025-05-07T20:32:29.2478991Z         if scale_ub is not None:
2025-05-07T20:32:29.2479265Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.2479600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.2479904Z             )
2025-05-07T20:32:29.2480101Z         else:
2025-05-07T20:32:29.2480317Z             scale_ub_tensor = None
2025-05-07T20:32:29.2480567Z     
2025-05-07T20:32:29.2480804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.2481116Z             op = silu_mul_quant
2025-05-07T20:32:29.2481361Z             if compiled:
2025-05-07T20:32:29.2481682Z                 op = torch.compile(op)
2025-05-07T20:32:29.2481980Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.2482254Z     
2025-05-07T20:32:29.2482444Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.2482729Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.2483019Z     
2025-05-07T20:32:29.2483252Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.2483585Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.2483878Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.2484259Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.2484619Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.2484983Z     
2025-05-07T20:32:29.2485179Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.2485381Z 
2025-05-07T20:32:29.2485482Z moe/activation_test.py:126: 
2025-05-07T20:32:29.2485786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.2486123Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.2486444Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.2487231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.2487995Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.2488540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.2489222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.2490189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.2490930Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.2491754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:29.2492666Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.2493548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.2494318Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.2495031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.2495654Z     fn()
2025-05-07T20:32:29.2496262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.2496955Z     self.fn.run(
2025-05-07T20:32:29.2497507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.2498140Z     kernel = self.compile(
2025-05-07T20:32:29.2498787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.2499568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.2500073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.2500297Z 
2025-05-07T20:32:29.2500510Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0973da890>
2025-05-07T20:32:29.2501583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.2503187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09694c700>}
2025-05-07T20:32:29.2504516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.2505619Z context = <triton._C.libtriton.ir.context object at 0x7fd096cbf6b0>
2025-05-07T20:32:29.2505910Z 
2025-05-07T20:32:29.2506077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.2506597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.2507055Z                            module_map=module_map)
2025-05-07T20:32:29.2507486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.2507839Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.2508151Z E       ^
2025-05-07T20:32:29.2508613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.2509070Z 
2025-05-07T20:32:29.2509484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.2509990Z 
2025-05-07T20:32:29.2510101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.2510505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.2510903Z     T=4096,
2025-05-07T20:32:29.2511089Z     D=5120,
2025-05-07T20:32:29.2511276Z     scale_ub=None,
2025-05-07T20:32:29.2511488Z     contiguous=True,
2025-05-07T20:32:29.2511709Z     compiled=True,
2025-05-07T20:32:29.2511903Z )
2025-05-07T20:32:29.9851149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.9851978Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.9852377Z 
2025-05-07T20:32:29.9852502Z     @given(
2025-05-07T20:32:29.9852846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.9853316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.9853772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.9854443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.9854946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.9855380Z     )
2025-05-07T20:32:29.9855902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.9856567Z     def test_silu_mul_quant(
2025-05-07T20:32:29.9856931Z         self,
2025-05-07T20:32:29.9857220Z         T: int,
2025-05-07T20:32:29.9857520Z         D: int,
2025-05-07T20:32:29.9857860Z         scale_ub: Optional[float],
2025-05-07T20:32:29.9858267Z         contiguous: bool,
2025-05-07T20:32:29.9858626Z         compiled: bool,
2025-05-07T20:32:29.9858978Z     ) -> None:
2025-05-07T20:32:29.9859302Z         torch.manual_seed(2025)
2025-05-07T20:32:29.9859581Z     
2025-05-07T20:32:29.9859983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.9860334Z     
2025-05-07T20:32:29.9860538Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.9867088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.9867412Z         x = x_sign * x_clamp
2025-05-07T20:32:29.9867651Z         x0 = x[:, :D]
2025-05-07T20:32:29.9867866Z         x1 = x[:, D:]
2025-05-07T20:32:29.9868077Z     
2025-05-07T20:32:29.9868257Z         if contiguous:
2025-05-07T20:32:29.9868495Z             x0 = x0.contiguous()
2025-05-07T20:32:29.9868757Z             x1 = x1.contiguous()
2025-05-07T20:32:29.9868994Z     
2025-05-07T20:32:29.9869187Z         if scale_ub is not None:
2025-05-07T20:32:29.9869495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.9869858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.9870166Z             )
2025-05-07T20:32:29.9870361Z         else:
2025-05-07T20:32:29.9870580Z             scale_ub_tensor = None
2025-05-07T20:32:29.9870828Z     
2025-05-07T20:32:29.9871067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.9871494Z             op = silu_mul_quant
2025-05-07T20:32:29.9871748Z             if compiled:
2025-05-07T20:32:29.9872009Z                 op = torch.compile(op)
2025-05-07T20:32:29.9872313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.9872584Z     
2025-05-07T20:32:29.9872783Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.9873072Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.9873357Z     
2025-05-07T20:32:29.9873601Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.9873937Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.9874304Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.9874678Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.9875046Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.9875356Z     
2025-05-07T20:32:29.9875571Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.9875775Z 
2025-05-07T20:32:29.9875884Z moe/activation_test.py:126: 
2025-05-07T20:32:29.9876193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.9876528Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.9876846Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.9877644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.9878397Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.9878942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.9879621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.9880306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.9881028Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.9881820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:29.9882579Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.9883302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.9883935Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.9884524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.9885043Z     fn()
2025-05-07T20:32:29.9885558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.9886134Z     self.fn.run(
2025-05-07T20:32:29.9886591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.9887122Z     kernel = self.compile(
2025-05-07T20:32:29.9887662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.9888308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.9888702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.9888932Z 
2025-05-07T20:32:29.9889141Z self = <triton.compiler.compiler.ASTSource object at 0x7fd096e9cf10>
2025-05-07T20:32:29.9890444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.9891810Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096894280>}
2025-05-07T20:32:29.9893225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.9894255Z context = <triton._C.libtriton.ir.context object at 0x7fd096b8e670>
2025-05-07T20:32:29.9894539Z 
2025-05-07T20:32:29.9894711Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.9895233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.9895760Z                            module_map=module_map)
2025-05-07T20:32:29.9896189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.9896550Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.9896813Z E       ^
2025-05-07T20:32:29.9897275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.9897722Z 
2025-05-07T20:32:29.9898137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.9898655Z 
2025-05-07T20:32:29.9898761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.9899172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.9899591Z     T=16384,
2025-05-07T20:32:29.9899883Z     D=5120,
2025-05-07T20:32:29.9900084Z     scale_ub=None,
2025-05-07T20:32:29.9900294Z     contiguous=True,
2025-05-07T20:32:29.9900523Z     compiled=True,
2025-05-07T20:32:29.9900727Z )
2025-05-07T20:32:30.0284479Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:30.0286461Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:30.0288439Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:30.0289672Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:30.0290914Z W0507 20:32:30.026000 87987 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:30.1315452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.1315999Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.1316276Z 
2025-05-07T20:32:30.1316366Z     @given(
2025-05-07T20:32:30.1316608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.1316936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.1317256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.1317602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.1317930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.1318221Z     )
2025-05-07T20:32:30.1318582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.1319023Z     def test_silu_mul_quant(
2025-05-07T20:32:30.1319276Z         self,
2025-05-07T20:32:30.1319479Z         T: int,
2025-05-07T20:32:30.1319683Z         D: int,
2025-05-07T20:32:30.1319916Z         scale_ub: Optional[float],
2025-05-07T20:32:30.1320206Z         contiguous: bool,
2025-05-07T20:32:30.1320447Z         compiled: bool,
2025-05-07T20:32:30.1320686Z     ) -> None:
2025-05-07T20:32:30.1320911Z         torch.manual_seed(2025)
2025-05-07T20:32:30.1321155Z     
2025-05-07T20:32:30.1321437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.1321889Z     
2025-05-07T20:32:30.1322082Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.1322388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.1322707Z         x = x_sign * x_clamp
2025-05-07T20:32:30.1322955Z         x0 = x[:, :D]
2025-05-07T20:32:30.1323175Z         x1 = x[:, D:]
2025-05-07T20:32:30.1323393Z     
2025-05-07T20:32:30.1323586Z         if contiguous:
2025-05-07T20:32:30.1323820Z             x0 = x0.contiguous()
2025-05-07T20:32:30.1324085Z             x1 = x1.contiguous()
2025-05-07T20:32:30.1324398Z     
2025-05-07T20:32:30.1324599Z         if scale_ub is not None:
2025-05-07T20:32:30.1324932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.1325279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.1325587Z             )
2025-05-07T20:32:30.1325790Z         else:
2025-05-07T20:32:30.1326011Z             scale_ub_tensor = None
2025-05-07T20:32:30.1326263Z     
2025-05-07T20:32:30.1326507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.1326826Z             op = silu_mul_quant
2025-05-07T20:32:30.1327079Z             if compiled:
2025-05-07T20:32:30.1327335Z                 op = torch.compile(op)
2025-05-07T20:32:30.1327634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1327907Z     
2025-05-07T20:32:30.1328101Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.1328391Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.1328676Z     
2025-05-07T20:32:30.1328918Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.1329253Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.1329550Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.1329864Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.1330226Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.1330534Z     
2025-05-07T20:32:30.1330819Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.1331027Z 
2025-05-07T20:32:30.1331130Z moe/activation_test.py:126: 
2025-05-07T20:32:30.1331432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1331768Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.1332096Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.1332888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.1333643Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.1334197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.1334883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.1335576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.1336304Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.1337047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:30.1337797Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.1338519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.1339161Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.1339864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.1340391Z     fn()
2025-05-07T20:32:30.1340901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.1341528Z     self.fn.run(
2025-05-07T20:32:30.1341998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.1342532Z     kernel = self.compile(
2025-05-07T20:32:30.1343075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.1343722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.1344127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1344396Z 
2025-05-07T20:32:30.1344611Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09659f310>
2025-05-07T20:32:30.1345760Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.1347146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096894a60>}
2025-05-07T20:32:30.1348478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.1349497Z context = <triton._C.libtriton.ir.context object at 0x7fd09646d0f0>
2025-05-07T20:32:30.1349782Z 
2025-05-07T20:32:30.1349957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.1350474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.1350953Z                            module_map=module_map)
2025-05-07T20:32:30.1351324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.1351685Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.1351947Z E       ^
2025-05-07T20:32:30.1352455Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.1352900Z 
2025-05-07T20:32:30.1353323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.1353830Z 
2025-05-07T20:32:30.1353945Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.1354347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.1354752Z     T=1,
2025-05-07T20:32:30.1354940Z     D=5120,
2025-05-07T20:32:30.1355136Z     scale_ub=1200.0,
2025-05-07T20:32:30.1355361Z     contiguous=True,
2025-05-07T20:32:30.1355590Z     compiled=True,
2025-05-07T20:32:30.1355795Z )
2025-05-07T20:32:30.2797689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2798231Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:30.2798493Z 
2025-05-07T20:32:30.2798594Z     @given(
2025-05-07T20:32:30.2798830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2799159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2799475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2799813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2800138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2800430Z     )
2025-05-07T20:32:30.2800786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2801225Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2801465Z         self,
2025-05-07T20:32:30.2801666Z         T: int,
2025-05-07T20:32:30.2801867Z         D: int,
2025-05-07T20:32:30.2802094Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2802379Z         contiguous: bool,
2025-05-07T20:32:30.2802624Z         compiled: bool,
2025-05-07T20:32:30.2802856Z     ) -> None:
2025-05-07T20:32:30.2803081Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2803430Z     
2025-05-07T20:32:30.2803712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2804062Z     
2025-05-07T20:32:30.2804258Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2804558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2804879Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2805145Z         x0 = x[:, :D]
2025-05-07T20:32:30.2805361Z         x1 = x[:, D:]
2025-05-07T20:32:30.2805571Z     
2025-05-07T20:32:30.2805757Z         if contiguous:
2025-05-07T20:32:30.2806056Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2806316Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2806554Z     
2025-05-07T20:32:30.2806797Z         if scale_ub is not None:
2025-05-07T20:32:30.2807076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2807414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2807715Z             )
2025-05-07T20:32:30.2807916Z         else:
2025-05-07T20:32:30.2808138Z             scale_ub_tensor = None
2025-05-07T20:32:30.2808383Z     
2025-05-07T20:32:30.2808617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2808930Z             op = silu_mul_quant
2025-05-07T20:32:30.2809179Z             if compiled:
2025-05-07T20:32:30.2809431Z                 op = torch.compile(op)
2025-05-07T20:32:30.2809733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2810008Z     
2025-05-07T20:32:30.2810198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.2810374Z 
2025-05-07T20:32:30.2810476Z moe/activation_test.py:117: 
2025-05-07T20:32:30.2810782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2811112Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.2811395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2811959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.2812576Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.2813242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.2813934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.2814474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2815146Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2815808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2816347Z     kernel = self.compile(
2025-05-07T20:32:30.2816885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2817528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2817922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2818148Z 
2025-05-07T20:32:30.2818369Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097188640>
2025-05-07T20:32:30.2819442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2820858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09688f1c0>}
2025-05-07T20:32:30.2822189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2823223Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ebe270>
2025-05-07T20:32:30.2823557Z 
2025-05-07T20:32:30.2823731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2824236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2824692Z                            module_map=module_map)
2025-05-07T20:32:30.2825057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2825403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.2825654Z E       ^
2025-05-07T20:32:30.2826118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2826614Z 
2025-05-07T20:32:30.2827070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2827584Z 
2025-05-07T20:32:30.2827690Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2828100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2828504Z     T=1,
2025-05-07T20:32:30.2828688Z     D=5120,
2025-05-07T20:32:30.2828876Z     scale_ub=None,
2025-05-07T20:32:30.2829089Z     contiguous=False,
2025-05-07T20:32:30.2829313Z     compiled=True,
2025-05-07T20:32:30.2829512Z )
2025-05-07T20:32:30.3502311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.3502986Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.3503249Z 
2025-05-07T20:32:30.3503335Z     @given(
2025-05-07T20:32:30.3503576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.3503899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.3504215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.3504539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.3504872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.3505161Z     )
2025-05-07T20:32:30.3505659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.3506109Z     def test_silu_mul_quant(
2025-05-07T20:32:30.3506354Z         self,
2025-05-07T20:32:30.3506552Z         T: int,
2025-05-07T20:32:30.3506751Z         D: int,
2025-05-07T20:32:30.3506975Z         scale_ub: Optional[float],
2025-05-07T20:32:30.3507250Z         contiguous: bool,
2025-05-07T20:32:30.3507497Z         compiled: bool,
2025-05-07T20:32:30.3507731Z     ) -> None:
2025-05-07T20:32:30.3507952Z         torch.manual_seed(2025)
2025-05-07T20:32:30.3508197Z     
2025-05-07T20:32:30.3508472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.3508803Z     
2025-05-07T20:32:30.3508992Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.3509284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.3509595Z         x = x_sign * x_clamp
2025-05-07T20:32:30.3509840Z         x0 = x[:, :D]
2025-05-07T20:32:30.3510060Z         x1 = x[:, D:]
2025-05-07T20:32:30.3510282Z     
2025-05-07T20:32:30.3510471Z         if contiguous:
2025-05-07T20:32:30.3510710Z             x0 = x0.contiguous()
2025-05-07T20:32:30.3510969Z             x1 = x1.contiguous()
2025-05-07T20:32:30.3511203Z     
2025-05-07T20:32:30.3511401Z         if scale_ub is not None:
2025-05-07T20:32:30.3511681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.3512016Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.3512322Z             )
2025-05-07T20:32:30.3512521Z         else:
2025-05-07T20:32:30.3512739Z             scale_ub_tensor = None
2025-05-07T20:32:30.3512986Z     
2025-05-07T20:32:30.3513229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.3513544Z             op = silu_mul_quant
2025-05-07T20:32:30.3513798Z             if compiled:
2025-05-07T20:32:30.3514050Z                 op = torch.compile(op)
2025-05-07T20:32:30.3514353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.3514696Z     
2025-05-07T20:32:30.3514900Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.3515194Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.3515481Z     
2025-05-07T20:32:30.3515720Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.3516053Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.3516343Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.3516650Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.3517009Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.3517381Z     
2025-05-07T20:32:30.3517581Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.3517776Z 
2025-05-07T20:32:30.3517936Z moe/activation_test.py:126: 
2025-05-07T20:32:30.3518237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.3518562Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.3518887Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.3519678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.3520420Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.3520956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.3521632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.3522316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.3523037Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.3523775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:30.3524555Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.3525283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.3525914Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.3526507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.3527018Z     fn()
2025-05-07T20:32:30.3527517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.3528085Z     self.fn.run(
2025-05-07T20:32:30.3528553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.3529076Z     kernel = self.compile(
2025-05-07T20:32:30.3529629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.3530304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.3530694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.3530916Z 
2025-05-07T20:32:30.3531130Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09659ec20>
2025-05-07T20:32:30.3532188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.3533550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09688e680>}
2025-05-07T20:32:30.3534872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.3535976Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ef73b0>
2025-05-07T20:32:30.3536257Z 
2025-05-07T20:32:30.3536429Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.3536939Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.3537401Z                            module_map=module_map)
2025-05-07T20:32:30.3537761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.3538110Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.3538418Z E       ^
2025-05-07T20:32:30.3538915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.3539355Z 
2025-05-07T20:32:30.3539894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.3540399Z 
2025-05-07T20:32:30.3540508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.3540918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.3541495Z     T=1,
2025-05-07T20:32:30.3541674Z     D=5120,
2025-05-07T20:32:30.3541870Z     scale_ub=None,
2025-05-07T20:32:30.3542086Z     contiguous=True,
2025-05-07T20:32:30.3542306Z     compiled=False,
2025-05-07T20:32:30.3542510Z )
2025-05-07T20:32:30.6804580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6812091Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:30.6812502Z 
2025-05-07T20:32:30.6812618Z     @given(
2025-05-07T20:32:30.6812963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6813416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6813780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6814119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6814578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6814870Z     )
2025-05-07T20:32:30.6815226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6815669Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6815911Z         self,
2025-05-07T20:32:30.6816105Z         T: int,
2025-05-07T20:32:30.6816308Z         D: int,
2025-05-07T20:32:30.6816531Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6816802Z         contiguous: bool,
2025-05-07T20:32:30.6817053Z         compiled: bool,
2025-05-07T20:32:30.6817283Z     ) -> None:
2025-05-07T20:32:30.6817495Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6817742Z     
2025-05-07T20:32:30.6818027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6818363Z     
2025-05-07T20:32:30.6818557Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6818851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6819152Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6819398Z         x0 = x[:, :D]
2025-05-07T20:32:30.6819638Z         x1 = x[:, D:]
2025-05-07T20:32:30.6819955Z     
2025-05-07T20:32:30.6820146Z         if contiguous:
2025-05-07T20:32:30.6820377Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6820628Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6820868Z     
2025-05-07T20:32:30.6821067Z         if scale_ub is not None:
2025-05-07T20:32:30.6821339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6821665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6821973Z             )
2025-05-07T20:32:30.6822167Z         else:
2025-05-07T20:32:30.6822379Z             scale_ub_tensor = None
2025-05-07T20:32:30.6822634Z     
2025-05-07T20:32:30.6822866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6823169Z             op = silu_mul_quant
2025-05-07T20:32:30.6823419Z             if compiled:
2025-05-07T20:32:30.6823671Z                 op = torch.compile(op)
2025-05-07T20:32:30.6824034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6824311Z     
2025-05-07T20:32:30.6824504Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6824671Z 
2025-05-07T20:32:30.6824775Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6825069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6825405Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6825686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6826368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6827186Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6827760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6828441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6829099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6829632Z     kernel = self.compile(
2025-05-07T20:32:30.6830172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6830817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6831206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6831440Z 
2025-05-07T20:32:30.6831652Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09670b7c0>
2025-05-07T20:32:30.6832725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6834129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096dbd900>}
2025-05-07T20:32:30.6835482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6836489Z context = <triton._C.libtriton.ir.context object at 0x7fcec1dfc2b0>
2025-05-07T20:32:30.6836776Z 
2025-05-07T20:32:30.6836940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6837463Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6837922Z                            module_map=module_map)
2025-05-07T20:32:30.6838290Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6838645Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6838898Z E       ^
2025-05-07T20:32:30.6839357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6839799Z 
2025-05-07T20:32:30.6840215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6840723Z 
2025-05-07T20:32:30.6840829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6841238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6841624Z     T=128,
2025-05-07T20:32:30.6841808Z     D=5120,
2025-05-07T20:32:30.6842003Z     scale_ub=None,
2025-05-07T20:32:30.6842212Z     contiguous=False,
2025-05-07T20:32:30.6842434Z     compiled=True,
2025-05-07T20:32:30.6842638Z )
2025-05-07T20:32:30.6842953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6843440Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.6843714Z 
2025-05-07T20:32:30.6843787Z     @given(
2025-05-07T20:32:30.6844066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6844368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6844669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6844995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6845312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6845592Z     )
2025-05-07T20:32:30.6845933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6846367Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6846647Z         self,
2025-05-07T20:32:30.6846839Z         T: int,
2025-05-07T20:32:30.6847035Z         D: int,
2025-05-07T20:32:30.6847286Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6847560Z         contiguous: bool,
2025-05-07T20:32:30.6847794Z         compiled: bool,
2025-05-07T20:32:30.6848014Z     ) -> None:
2025-05-07T20:32:30.6848229Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6848470Z     
2025-05-07T20:32:30.6848736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6849069Z     
2025-05-07T20:32:30.6849260Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6849548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6849899Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6850137Z         x0 = x[:, :D]
2025-05-07T20:32:30.6850347Z         x1 = x[:, D:]
2025-05-07T20:32:30.6850552Z     
2025-05-07T20:32:30.6850740Z         if contiguous:
2025-05-07T20:32:30.6850965Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6851225Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6851458Z     
2025-05-07T20:32:30.6851652Z         if scale_ub is not None:
2025-05-07T20:32:30.6851925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6852260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6852558Z             )
2025-05-07T20:32:30.6852741Z         else:
2025-05-07T20:32:30.6852999Z             scale_ub_tensor = None
2025-05-07T20:32:30.6853245Z     
2025-05-07T20:32:30.6853467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6853778Z             op = silu_mul_quant
2025-05-07T20:32:30.6854030Z             if compiled:
2025-05-07T20:32:30.6854274Z                 op = torch.compile(op)
2025-05-07T20:32:30.6854569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6854839Z     
2025-05-07T20:32:30.6855025Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6855193Z 
2025-05-07T20:32:30.6855295Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6855586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6855920Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6856191Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6856741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.6857292Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.6857944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6858623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6859153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6859924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6860579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6861109Z     kernel = self.compile(
2025-05-07T20:32:30.6861650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6862288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6862678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6862965Z 
2025-05-07T20:32:30.6863173Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1ec7f70>
2025-05-07T20:32:30.6864236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6865602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096dbfeb0>}
2025-05-07T20:32:30.6867014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6868026Z context = <triton._C.libtriton.ir.context object at 0x7fcec1d228f0>
2025-05-07T20:32:30.6868320Z 
2025-05-07T20:32:30.6868487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6869004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6869457Z                            module_map=module_map)
2025-05-07T20:32:30.6869872Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6870218Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6870466Z E       ^
2025-05-07T20:32:30.6870924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6871375Z 
2025-05-07T20:32:30.6871788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6872305Z 
2025-05-07T20:32:30.6872414Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6872815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6873212Z     T=128,
2025-05-07T20:32:30.6873438Z     D=7168,
2025-05-07T20:32:30.6873627Z     scale_ub=1200.0,
2025-05-07T20:32:30.6873847Z     contiguous=False,
2025-05-07T20:32:30.6874068Z     compiled=False,
2025-05-07T20:32:30.6874262Z )
2025-05-07T20:32:30.8126182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8127310Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:30.8127885Z 
2025-05-07T20:32:30.8128047Z     @given(
2025-05-07T20:32:30.8128522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8129032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8129488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8129816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8130149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8130431Z     )
2025-05-07T20:32:30.8130785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8131228Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8131469Z         self,
2025-05-07T20:32:30.8131669Z         T: int,
2025-05-07T20:32:30.8131875Z         D: int,
2025-05-07T20:32:30.8132092Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8132365Z         contiguous: bool,
2025-05-07T20:32:30.8132610Z         compiled: bool,
2025-05-07T20:32:30.8132836Z     ) -> None:
2025-05-07T20:32:30.8133059Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8133294Z     
2025-05-07T20:32:30.8133572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8133903Z     
2025-05-07T20:32:30.8134106Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8134403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8134716Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8134962Z         x0 = x[:, :D]
2025-05-07T20:32:30.8135181Z         x1 = x[:, D:]
2025-05-07T20:32:30.8135510Z     
2025-05-07T20:32:30.8135696Z         if contiguous:
2025-05-07T20:32:30.8135931Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8136194Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8136427Z     
2025-05-07T20:32:30.8136620Z         if scale_ub is not None:
2025-05-07T20:32:30.8136891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8137222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8137532Z             )
2025-05-07T20:32:30.8137728Z         else:
2025-05-07T20:32:30.8137934Z             scale_ub_tensor = None
2025-05-07T20:32:30.8138288Z     
2025-05-07T20:32:30.8138524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8138893Z             op = silu_mul_quant
2025-05-07T20:32:30.8139147Z             if compiled:
2025-05-07T20:32:30.8139400Z                 op = torch.compile(op)
2025-05-07T20:32:30.8139694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8140037Z     
2025-05-07T20:32:30.8140244Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.8140413Z 
2025-05-07T20:32:30.8140532Z moe/activation_test.py:117: 
2025-05-07T20:32:30.8140828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8141167Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.8141455Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8142143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.8142839Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.8143374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8144056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8144712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8145239Z     kernel = self.compile(
2025-05-07T20:32:30.8145844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8146502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8146899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8147130Z 
2025-05-07T20:32:30.8147346Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1dc44c0>
2025-05-07T20:32:30.8148422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8149793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096dbd7e0>}
2025-05-07T20:32:30.8151140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8152156Z context = <triton._C.libtriton.ir.context object at 0x7fcec1d695b0>
2025-05-07T20:32:30.8152448Z 
2025-05-07T20:32:30.8152616Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8153144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8153612Z                            module_map=module_map)
2025-05-07T20:32:30.8153975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8154331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.8154582Z E       ^
2025-05-07T20:32:30.8155046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8155539Z 
2025-05-07T20:32:30.8155954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8156460Z 
2025-05-07T20:32:30.8156571Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8156978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8157376Z     T=128,
2025-05-07T20:32:30.8157563Z     D=5120,
2025-05-07T20:32:30.8157752Z     scale_ub=None,
2025-05-07T20:32:30.8157975Z     contiguous=False,
2025-05-07T20:32:30.8158204Z     compiled=False,
2025-05-07T20:32:30.8158451Z )
2025-05-07T20:32:30.8158773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8159325Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.8159596Z 
2025-05-07T20:32:30.8159681Z     @given(
2025-05-07T20:32:30.8159912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8160230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8160543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8160868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8161194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8161474Z     )
2025-05-07T20:32:30.8161811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8162245Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8162484Z         self,
2025-05-07T20:32:30.8162672Z         T: int,
2025-05-07T20:32:30.8162873Z         D: int,
2025-05-07T20:32:30.8163089Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8163360Z         contiguous: bool,
2025-05-07T20:32:30.8163606Z         compiled: bool,
2025-05-07T20:32:30.8163830Z     ) -> None:
2025-05-07T20:32:30.8164045Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8164279Z     
2025-05-07T20:32:30.8164546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8164883Z     
2025-05-07T20:32:30.8165122Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8165412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8165721Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8165955Z         x0 = x[:, :D]
2025-05-07T20:32:30.8166175Z         x1 = x[:, D:]
2025-05-07T20:32:30.8166379Z     
2025-05-07T20:32:30.8166556Z         if contiguous:
2025-05-07T20:32:30.8166782Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8167036Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8167264Z     
2025-05-07T20:32:30.8167457Z         if scale_ub is not None:
2025-05-07T20:32:30.8167726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8168058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8168362Z             )
2025-05-07T20:32:30.8168552Z         else:
2025-05-07T20:32:30.8168763Z             scale_ub_tensor = None
2025-05-07T20:32:30.8169011Z     
2025-05-07T20:32:30.8169239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8169559Z             op = silu_mul_quant
2025-05-07T20:32:30.8169807Z             if compiled:
2025-05-07T20:32:30.8170056Z                 op = torch.compile(op)
2025-05-07T20:32:30.8170347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8170611Z     
2025-05-07T20:32:30.8170802Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.8170983Z 
2025-05-07T20:32:30.8171082Z moe/activation_test.py:117: 
2025-05-07T20:32:30.8171370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8171695Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.8171973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8172655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.8173331Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.8173866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8174598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8175252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8175778Z     kernel = self.compile(
2025-05-07T20:32:30.8176311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8176953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8177387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8177645Z 
2025-05-07T20:32:30.8177855Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1e822c0>
2025-05-07T20:32:30.8178920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8180358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09696beb0>}
2025-05-07T20:32:30.8181685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8182693Z context = <triton._C.libtriton.ir.context object at 0x7fd0961a4f70>
2025-05-07T20:32:30.8182982Z 
2025-05-07T20:32:30.8183151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8183673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8184136Z                            module_map=module_map)
2025-05-07T20:32:30.8184543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8184899Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.8185159Z E       ^
2025-05-07T20:32:30.8185613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8186056Z 
2025-05-07T20:32:30.8186467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8186977Z 
2025-05-07T20:32:30.8187082Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8187495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8187887Z     T=128,
2025-05-07T20:32:30.8188077Z     D=5120,
2025-05-07T20:32:30.8188271Z     scale_ub=1200.0,
2025-05-07T20:32:30.8188491Z     contiguous=True,
2025-05-07T20:32:30.8188714Z     compiled=False,
2025-05-07T20:32:30.8188912Z )
2025-05-07T20:32:31.0110889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.0111668Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.0112046Z 
2025-05-07T20:32:31.0112167Z     @given(
2025-05-07T20:32:31.0112455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.0112777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.0113090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.0113420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.0113756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.0114044Z     )
2025-05-07T20:32:31.0114389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.0114830Z     def test_silu_mul_quant(
2025-05-07T20:32:31.0115067Z         self,
2025-05-07T20:32:31.0115256Z         T: int,
2025-05-07T20:32:31.0115459Z         D: int,
2025-05-07T20:32:31.0115688Z         scale_ub: Optional[float],
2025-05-07T20:32:31.0116070Z         contiguous: bool,
2025-05-07T20:32:31.0116309Z         compiled: bool,
2025-05-07T20:32:31.0116536Z     ) -> None:
2025-05-07T20:32:31.0116758Z         torch.manual_seed(2025)
2025-05-07T20:32:31.0116990Z     
2025-05-07T20:32:31.0117271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.0117611Z     
2025-05-07T20:32:31.0117801Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.0118095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.0118403Z         x = x_sign * x_clamp
2025-05-07T20:32:31.0118710Z         x0 = x[:, :D]
2025-05-07T20:32:31.0118931Z         x1 = x[:, D:]
2025-05-07T20:32:31.0119140Z     
2025-05-07T20:32:31.0119378Z         if contiguous:
2025-05-07T20:32:31.0119617Z             x0 = x0.contiguous()
2025-05-07T20:32:31.0119877Z             x1 = x1.contiguous()
2025-05-07T20:32:31.0120109Z     
2025-05-07T20:32:31.0120308Z         if scale_ub is not None:
2025-05-07T20:32:31.0120587Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.0120934Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.0121242Z             )
2025-05-07T20:32:31.0121439Z         else:
2025-05-07T20:32:31.0121659Z             scale_ub_tensor = None
2025-05-07T20:32:31.0121903Z     
2025-05-07T20:32:31.0122144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.0122464Z             op = silu_mul_quant
2025-05-07T20:32:31.0122716Z             if compiled:
2025-05-07T20:32:31.0122971Z                 op = torch.compile(op)
2025-05-07T20:32:31.0123271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.0123538Z     
2025-05-07T20:32:31.0123741Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.0123913Z 
2025-05-07T20:32:31.0124027Z moe/activation_test.py:117: 
2025-05-07T20:32:31.0124315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.0124649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.0124944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.0125698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.0126391Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.0133246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.0133998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.0134671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.0135220Z     kernel = self.compile(
2025-05-07T20:32:31.0135767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.0136430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.0136834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.0137070Z 
2025-05-07T20:32:31.0137291Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1d919c0>
2025-05-07T20:32:31.0138362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.0139739Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09696ad40>}
2025-05-07T20:32:31.0141200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.0142235Z context = <triton._C.libtriton.ir.context object at 0x7fd0961505f0>
2025-05-07T20:32:31.0142600Z 
2025-05-07T20:32:31.0142784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.0143317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.0143791Z                            module_map=module_map)
2025-05-07T20:32:31.0144165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.0144521Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.0144784Z E       ^
2025-05-07T20:32:31.0145251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.0145751Z 
2025-05-07T20:32:31.0146214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.0146735Z 
2025-05-07T20:32:31.0146844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.0147264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.0147675Z     T=1,
2025-05-07T20:32:31.0147862Z     D=7168,
2025-05-07T20:32:31.0148066Z     scale_ub=1200.0,
2025-05-07T20:32:31.0148299Z     contiguous=True,
2025-05-07T20:32:31.0148518Z     compiled=True,
2025-05-07T20:32:31.0148728Z )
2025-05-07T20:32:31.0149048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.0149536Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:31.0149796Z 
2025-05-07T20:32:31.0149875Z     @given(
2025-05-07T20:32:31.0150112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.0150420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.0150717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.0151042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.0151368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.0151637Z     )
2025-05-07T20:32:31.0152024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.0152466Z     def test_silu_mul_quant(
2025-05-07T20:32:31.0152699Z         self,
2025-05-07T20:32:31.0152891Z         T: int,
2025-05-07T20:32:31.0153087Z         D: int,
2025-05-07T20:32:31.0153301Z         scale_ub: Optional[float],
2025-05-07T20:32:31.0153570Z         contiguous: bool,
2025-05-07T20:32:31.0153809Z         compiled: bool,
2025-05-07T20:32:31.0154034Z     ) -> None:
2025-05-07T20:32:31.0154243Z         torch.manual_seed(2025)
2025-05-07T20:32:31.0154485Z     
2025-05-07T20:32:31.0154756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.0155088Z     
2025-05-07T20:32:31.0155284Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.0155570Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.0155876Z         x = x_sign * x_clamp
2025-05-07T20:32:31.0156111Z         x0 = x[:, :D]
2025-05-07T20:32:31.0156327Z         x1 = x[:, D:]
2025-05-07T20:32:31.0156527Z     
2025-05-07T20:32:31.0156718Z         if contiguous:
2025-05-07T20:32:31.0156941Z             x0 = x0.contiguous()
2025-05-07T20:32:31.0157186Z             x1 = x1.contiguous()
2025-05-07T20:32:31.0157418Z     
2025-05-07T20:32:31.0157614Z         if scale_ub is not None:
2025-05-07T20:32:31.0157879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.0158208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.0158506Z             )
2025-05-07T20:32:31.0158687Z         else:
2025-05-07T20:32:31.0158896Z             scale_ub_tensor = None
2025-05-07T20:32:31.0159144Z     
2025-05-07T20:32:31.0159367Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.0159702Z             op = silu_mul_quant
2025-05-07T20:32:31.0159973Z             if compiled:
2025-05-07T20:32:31.0160213Z                 op = torch.compile(op)
2025-05-07T20:32:31.0160499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.0160816Z     
2025-05-07T20:32:31.0161005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.0161168Z 
2025-05-07T20:32:31.0161262Z moe/activation_test.py:117: 
2025-05-07T20:32:31.0161550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.0161875Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.0162145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.0162694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.0163240Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.0163965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.0164652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.0165178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.0165845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.0166497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.0167013Z     kernel = self.compile(
2025-05-07T20:32:31.0167541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.0168187Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.0168568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.0168795Z 
2025-05-07T20:32:31.0168995Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0961d2290>
2025-05-07T20:32:31.0170108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.0171495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096969480>}
2025-05-07T20:32:31.0172819Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.0173834Z context = <triton._C.libtriton.ir.context object at 0x7fcec1f693b0>
2025-05-07T20:32:31.0174121Z 
2025-05-07T20:32:31.0174289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.0174799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.0175259Z                            module_map=module_map)
2025-05-07T20:32:31.0175611Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.0175952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.0176202Z E       ^
2025-05-07T20:32:31.0176654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.0177104Z 
2025-05-07T20:32:31.0177512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.0178017Z 
2025-05-07T20:32:31.0178114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.0178520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.0178907Z     T=1,
2025-05-07T20:32:31.0179084Z     D=7168,
2025-05-07T20:32:31.0179271Z     scale_ub=1200.0,
2025-05-07T20:32:31.0179488Z     contiguous=False,
2025-05-07T20:32:31.0179723Z     compiled=True,
2025-05-07T20:32:31.0180014Z )
2025-05-07T20:32:31.1552061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1552835Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.1553347Z 
2025-05-07T20:32:31.1553437Z     @given(
2025-05-07T20:32:31.1553667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1553980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1554286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1554613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1554942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1555233Z     )
2025-05-07T20:32:31.1555581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1556092Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1556342Z         self,
2025-05-07T20:32:31.1556600Z         T: int,
2025-05-07T20:32:31.1556798Z         D: int,
2025-05-07T20:32:31.1557017Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1557299Z         contiguous: bool,
2025-05-07T20:32:31.1557534Z         compiled: bool,
2025-05-07T20:32:31.1557763Z     ) -> None:
2025-05-07T20:32:31.1557989Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1558224Z     
2025-05-07T20:32:31.1558498Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1558836Z     
2025-05-07T20:32:31.1559035Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1559323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1559641Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1559882Z         x0 = x[:, :D]
2025-05-07T20:32:31.1560102Z         x1 = x[:, D:]
2025-05-07T20:32:31.1560308Z     
2025-05-07T20:32:31.1560497Z         if contiguous:
2025-05-07T20:32:31.1560727Z             x0 = x0.contiguous()
2025-05-07T20:32:31.1560988Z             x1 = x1.contiguous()
2025-05-07T20:32:31.1561230Z     
2025-05-07T20:32:31.1561419Z         if scale_ub is not None:
2025-05-07T20:32:31.1561690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.1562021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.1562327Z             )
2025-05-07T20:32:31.1562599Z         else:
2025-05-07T20:32:31.1562819Z             scale_ub_tensor = None
2025-05-07T20:32:31.1563077Z     
2025-05-07T20:32:31.1563312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.1563630Z             op = silu_mul_quant
2025-05-07T20:32:31.1563882Z             if compiled:
2025-05-07T20:32:31.1564130Z                 op = torch.compile(op)
2025-05-07T20:32:31.1564426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1564691Z     
2025-05-07T20:32:31.1564887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.1565064Z 
2025-05-07T20:32:31.1565168Z moe/activation_test.py:117: 
2025-05-07T20:32:31.1565472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1565798Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.1566080Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1566635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.1567196Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.1567855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.1568533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.1569076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.1569742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.1570409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.1570939Z     kernel = self.compile(
2025-05-07T20:32:31.1571472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.1572131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1572650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1572876Z 
2025-05-07T20:32:31.1573082Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0961034f0>
2025-05-07T20:32:31.1574152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.1575525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096968940>}
2025-05-07T20:32:31.1576959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.1577993Z context = <triton._C.libtriton.ir.context object at 0x7fcec1fc84f0>
2025-05-07T20:32:31.1578288Z 
2025-05-07T20:32:31.1578456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.1578975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1579444Z                            module_map=module_map)
2025-05-07T20:32:31.1579898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1580253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1580509Z E       ^
2025-05-07T20:32:31.1580972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1581415Z 
2025-05-07T20:32:31.1581829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1582340Z 
2025-05-07T20:32:31.1582445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1582943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1583352Z     T=1,
2025-05-07T20:32:31.1583526Z     D=7168,
2025-05-07T20:32:31.1583717Z     scale_ub=None,
2025-05-07T20:32:31.1583930Z     contiguous=False,
2025-05-07T20:32:31.1584150Z     compiled=True,
2025-05-07T20:32:31.1584350Z )
2025-05-07T20:32:31.4185281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4186033Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:31.4186397Z 
2025-05-07T20:32:31.4186517Z     @given(
2025-05-07T20:32:31.4186841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4187251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4187560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4187897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4188230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4188513Z     )
2025-05-07T20:32:31.4188867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4189303Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4189542Z         self,
2025-05-07T20:32:31.4189734Z         T: int,
2025-05-07T20:32:31.4190097Z         D: int,
2025-05-07T20:32:31.4190321Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4190594Z         contiguous: bool,
2025-05-07T20:32:31.4190834Z         compiled: bool,
2025-05-07T20:32:31.4191057Z     ) -> None:
2025-05-07T20:32:31.4191274Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4191528Z     
2025-05-07T20:32:31.4191809Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4192154Z     
2025-05-07T20:32:31.4192357Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.4192654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.4192961Z         x = x_sign * x_clamp
2025-05-07T20:32:31.4193209Z         x0 = x[:, :D]
2025-05-07T20:32:31.4193544Z         x1 = x[:, D:]
2025-05-07T20:32:31.4193759Z     
2025-05-07T20:32:31.4193950Z         if contiguous:
2025-05-07T20:32:31.4194188Z             x0 = x0.contiguous()
2025-05-07T20:32:31.4194447Z             x1 = x1.contiguous()
2025-05-07T20:32:31.4194691Z     
2025-05-07T20:32:31.4194893Z         if scale_ub is not None:
2025-05-07T20:32:31.4195170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.4195498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.4195801Z             )
2025-05-07T20:32:31.4196068Z         else:
2025-05-07T20:32:31.4196284Z             scale_ub_tensor = None
2025-05-07T20:32:31.4196534Z     
2025-05-07T20:32:31.4196832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.4197148Z             op = silu_mul_quant
2025-05-07T20:32:31.4197404Z             if compiled:
2025-05-07T20:32:31.4197650Z                 op = torch.compile(op)
2025-05-07T20:32:31.4197941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4198222Z     
2025-05-07T20:32:31.4198409Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.4198687Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.4198972Z     
2025-05-07T20:32:31.4199229Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.4199566Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.4199853Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.4200170Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.4200539Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.4200840Z     
2025-05-07T20:32:31.4201047Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.4201245Z 
2025-05-07T20:32:31.4201345Z moe/activation_test.py:126: 
2025-05-07T20:32:31.4201643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4201966Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.4202353Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.4203144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.4203885Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.4204427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.4205104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.4205782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.4206495Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.4207240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:31.4207983Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.4208708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.4209333Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.4209936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.4210450Z     fn()
2025-05-07T20:32:31.4210946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.4211524Z     self.fn.run(
2025-05-07T20:32:31.4211993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.4212522Z     kernel = self.compile(
2025-05-07T20:32:31.4213055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.4213761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.4214158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4214382Z 
2025-05-07T20:32:31.4214590Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0961f2650>
2025-05-07T20:32:31.4215661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.4217110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0965a2ef0>}
2025-05-07T20:32:31.4218433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.4219458Z context = <triton._C.libtriton.ir.context object at 0x7fcec1fd44f0>
2025-05-07T20:32:31.4219740Z 
2025-05-07T20:32:31.4219978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.4220496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.4220961Z                            module_map=module_map)
2025-05-07T20:32:31.4221327Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.4221680Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.4221946Z E       ^
2025-05-07T20:32:31.4222413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.4222856Z 
2025-05-07T20:32:31.4223272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.4223785Z 
2025-05-07T20:32:31.4223937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.4224350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.4224737Z     T=1,
2025-05-07T20:32:31.4224912Z     D=5120,
2025-05-07T20:32:31.4225103Z     scale_ub=1200.0,
2025-05-07T20:32:31.4225330Z     contiguous=False,
2025-05-07T20:32:31.4225551Z     compiled=True,
2025-05-07T20:32:31.4225755Z )
2025-05-07T20:32:31.5900825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5901599Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.5901977Z 
2025-05-07T20:32:31.5902087Z     @given(
2025-05-07T20:32:31.5902422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5902734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5903053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5903393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5903738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5904024Z     )
2025-05-07T20:32:31.5904383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5904833Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5905078Z         self,
2025-05-07T20:32:31.5905287Z         T: int,
2025-05-07T20:32:31.5905488Z         D: int,
2025-05-07T20:32:31.5905713Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5905991Z         contiguous: bool,
2025-05-07T20:32:31.5906237Z         compiled: bool,
2025-05-07T20:32:31.5906465Z     ) -> None:
2025-05-07T20:32:31.5906691Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5906943Z     
2025-05-07T20:32:31.5907213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5907554Z     
2025-05-07T20:32:31.5907753Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5908036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5908467Z         x = x_sign * x_clamp
2025-05-07T20:32:31.5908713Z         x0 = x[:, :D]
2025-05-07T20:32:31.5908931Z         x1 = x[:, D:]
2025-05-07T20:32:31.5909135Z     
2025-05-07T20:32:31.5909322Z         if contiguous:
2025-05-07T20:32:31.5909556Z             x0 = x0.contiguous()
2025-05-07T20:32:31.5909808Z             x1 = x1.contiguous()
2025-05-07T20:32:31.5910046Z     
2025-05-07T20:32:31.5910242Z         if scale_ub is not None:
2025-05-07T20:32:31.5910516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.5910851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.5911237Z             )
2025-05-07T20:32:31.5911421Z         else:
2025-05-07T20:32:31.5911691Z             scale_ub_tensor = None
2025-05-07T20:32:31.5911949Z     
2025-05-07T20:32:31.5912179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.5912492Z             op = silu_mul_quant
2025-05-07T20:32:31.5912744Z             if compiled:
2025-05-07T20:32:31.5912996Z                 op = torch.compile(op)
2025-05-07T20:32:31.5913290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5913564Z     
2025-05-07T20:32:31.5913760Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.5913935Z 
2025-05-07T20:32:31.5914044Z moe/activation_test.py:117: 
2025-05-07T20:32:31.5914346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5914681Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.5914964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5915526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.5916090Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.5916747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.5917432Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.5918036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.5918723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.5919378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.5919915Z     kernel = self.compile(
2025-05-07T20:32:31.5920461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.5921114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.5921509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5921746Z 
2025-05-07T20:32:31.5921955Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1f56b90>
2025-05-07T20:32:31.5923030Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.5924404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0965a3eb0>}
2025-05-07T20:32:31.5925731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.5926761Z context = <triton._C.libtriton.ir.context object at 0x7fd09634e270>
2025-05-07T20:32:31.5927053Z 
2025-05-07T20:32:31.5927227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.5933209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.5933737Z                            module_map=module_map)
2025-05-07T20:32:31.5934195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.5934556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.5934819Z E       ^
2025-05-07T20:32:31.5935298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.5935755Z 
2025-05-07T20:32:31.5936179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.5936700Z 
2025-05-07T20:32:31.5936860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5937281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5937727Z     T=1,
2025-05-07T20:32:31.5937926Z     D=5120,
2025-05-07T20:32:31.5938124Z     scale_ub=1200.0,
2025-05-07T20:32:31.5938352Z     contiguous=False,
2025-05-07T20:32:31.5938585Z     compiled=False,
2025-05-07T20:32:31.5938789Z )
2025-05-07T20:32:31.5939116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5939613Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.5939943Z 
2025-05-07T20:32:31.5940035Z     @given(
2025-05-07T20:32:31.5940270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5940591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5940909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5941238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5941579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5941869Z     )
2025-05-07T20:32:31.5942225Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5942671Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5942917Z         self,
2025-05-07T20:32:31.5943119Z         T: int,
2025-05-07T20:32:31.5943322Z         D: int,
2025-05-07T20:32:31.5943555Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5943877Z         contiguous: bool,
2025-05-07T20:32:31.5944128Z         compiled: bool,
2025-05-07T20:32:31.5944362Z     ) -> None:
2025-05-07T20:32:31.5944585Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5944834Z     
2025-05-07T20:32:31.5945107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5945455Z     
2025-05-07T20:32:31.5945650Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5945946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5946257Z         x = x_sign * x_clamp
2025-05-07T20:32:31.5946504Z         x0 = x[:, :D]
2025-05-07T20:32:31.5946725Z         x1 = x[:, D:]
2025-05-07T20:32:31.5946940Z     
2025-05-07T20:32:31.5947127Z         if contiguous:
2025-05-07T20:32:31.5947364Z             x0 = x0.contiguous()
2025-05-07T20:32:31.5947624Z             x1 = x1.contiguous()
2025-05-07T20:32:31.5947866Z     
2025-05-07T20:32:31.5948064Z         if scale_ub is not None:
2025-05-07T20:32:31.5948352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.5948693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.5949003Z             )
2025-05-07T20:32:31.5949204Z         else:
2025-05-07T20:32:31.5949421Z             scale_ub_tensor = None
2025-05-07T20:32:31.5949675Z     
2025-05-07T20:32:31.5949910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.5950226Z             op = silu_mul_quant
2025-05-07T20:32:31.5950483Z             if compiled:
2025-05-07T20:32:31.5950738Z                 op = torch.compile(op)
2025-05-07T20:32:31.5951043Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5951317Z     
2025-05-07T20:32:31.5951520Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.5951684Z 
2025-05-07T20:32:31.5951792Z moe/activation_test.py:117: 
2025-05-07T20:32:31.5952084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5952419Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.5952769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5953476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.5954166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.5954711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.5955392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.5956110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.5956683Z     kernel = self.compile(
2025-05-07T20:32:31.5957234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.5957889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.5958295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5958522Z 
2025-05-07T20:32:31.5958731Z self = <triton.compiler.compiler.ASTSource object at 0x7fd096379210>
2025-05-07T20:32:31.5959823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.5961213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09696be20>}
2025-05-07T20:32:31.5962556Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.5963572Z context = <triton._C.libtriton.ir.context object at 0x7fd0963439b0>
2025-05-07T20:32:31.5963904Z 
2025-05-07T20:32:31.5964085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.5964622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.5965089Z                            module_map=module_map)
2025-05-07T20:32:31.5965459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.5965814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.5966072Z E       ^
2025-05-07T20:32:31.5966538Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.5966984Z 
2025-05-07T20:32:31.5967411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.5967922Z 
2025-05-07T20:32:31.5968030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5968445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5968848Z     T=16384,
2025-05-07T20:32:31.5969042Z     D=5120,
2025-05-07T20:32:31.5969238Z     scale_ub=1200.0,
2025-05-07T20:32:31.5969466Z     contiguous=False,
2025-05-07T20:32:31.5969698Z     compiled=True,
2025-05-07T20:32:31.5969926Z )
2025-05-07T20:32:31.6964154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6965307Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.6965738Z 
2025-05-07T20:32:31.6965861Z     @given(
2025-05-07T20:32:31.6966220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6966688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6967147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6967639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6968127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6968695Z     )
2025-05-07T20:32:31.6969215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6969862Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6970097Z         self,
2025-05-07T20:32:31.6970301Z         T: int,
2025-05-07T20:32:31.6970503Z         D: int,
2025-05-07T20:32:31.6970721Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6971002Z         contiguous: bool,
2025-05-07T20:32:31.6971245Z         compiled: bool,
2025-05-07T20:32:31.6971466Z     ) -> None:
2025-05-07T20:32:31.6971679Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6971993Z     
2025-05-07T20:32:31.6972265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6972667Z     
2025-05-07T20:32:31.6972864Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.6973150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.6973471Z         x = x_sign * x_clamp
2025-05-07T20:32:31.6973717Z         x0 = x[:, :D]
2025-05-07T20:32:31.6973943Z         x1 = x[:, D:]
2025-05-07T20:32:31.6974150Z     
2025-05-07T20:32:31.6974339Z         if contiguous:
2025-05-07T20:32:31.6974574Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6974840Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6975080Z     
2025-05-07T20:32:31.6975273Z         if scale_ub is not None:
2025-05-07T20:32:31.6975543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6975874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6976185Z             )
2025-05-07T20:32:31.6976371Z         else:
2025-05-07T20:32:31.6976592Z             scale_ub_tensor = None
2025-05-07T20:32:31.6976838Z     
2025-05-07T20:32:31.6977071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6977386Z             op = silu_mul_quant
2025-05-07T20:32:31.6977645Z             if compiled:
2025-05-07T20:32:31.6977896Z                 op = torch.compile(op)
2025-05-07T20:32:31.6978189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6978470Z     
2025-05-07T20:32:31.6978731Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.6978903Z 
2025-05-07T20:32:31.6979006Z moe/activation_test.py:117: 
2025-05-07T20:32:31.6979312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6979645Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.6979986Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6980548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.6981110Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.6981762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.6982442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.6982975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6983662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6984320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6984844Z     kernel = self.compile(
2025-05-07T20:32:31.6985383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6986031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6986418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6986651Z 
2025-05-07T20:32:31.6986861Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1af0640>
2025-05-07T20:32:31.6987925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6989343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0c8b0>}
2025-05-07T20:32:31.6990836Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6991866Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ac4570>
2025-05-07T20:32:31.6992228Z 
2025-05-07T20:32:31.6992397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6992992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6993448Z                            module_map=module_map)
2025-05-07T20:32:31.6993812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6994166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.6994421Z E       ^
2025-05-07T20:32:31.6994888Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6995335Z 
2025-05-07T20:32:31.6995751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.6996259Z 
2025-05-07T20:32:31.6996369Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6996776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6997181Z     T=2048,
2025-05-07T20:32:31.6997372Z     D=7168,
2025-05-07T20:32:31.6997560Z     scale_ub=1200.0,
2025-05-07T20:32:31.6997788Z     contiguous=False,
2025-05-07T20:32:31.6998017Z     compiled=True,
2025-05-07T20:32:31.6998216Z )
2025-05-07T20:32:31.6998528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6999083Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.6999356Z 
2025-05-07T20:32:31.6999439Z     @given(
2025-05-07T20:32:31.6999667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6999979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7000288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7000610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7000944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7001235Z     )
2025-05-07T20:32:31.7001590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7002025Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7002272Z         self,
2025-05-07T20:32:31.7002466Z         T: int,
2025-05-07T20:32:31.7002661Z         D: int,
2025-05-07T20:32:31.7002882Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7003154Z         contiguous: bool,
2025-05-07T20:32:31.7003386Z         compiled: bool,
2025-05-07T20:32:31.7003614Z     ) -> None:
2025-05-07T20:32:31.7003833Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7004072Z     
2025-05-07T20:32:31.7004342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7004685Z     
2025-05-07T20:32:31.7004878Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.7005165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.7005474Z         x = x_sign * x_clamp
2025-05-07T20:32:31.7005716Z         x0 = x[:, :D]
2025-05-07T20:32:31.7005930Z         x1 = x[:, D:]
2025-05-07T20:32:31.7006139Z     
2025-05-07T20:32:31.7006324Z         if contiguous:
2025-05-07T20:32:31.7006554Z             x0 = x0.contiguous()
2025-05-07T20:32:31.7006818Z             x1 = x1.contiguous()
2025-05-07T20:32:31.7007063Z     
2025-05-07T20:32:31.7007253Z         if scale_ub is not None:
2025-05-07T20:32:31.7007529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.7007872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.7008244Z             )
2025-05-07T20:32:31.7008437Z         else:
2025-05-07T20:32:31.7008648Z             scale_ub_tensor = None
2025-05-07T20:32:31.7008895Z     
2025-05-07T20:32:31.7009125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.7009441Z             op = silu_mul_quant
2025-05-07T20:32:31.7009692Z             if compiled:
2025-05-07T20:32:31.7009941Z                 op = torch.compile(op)
2025-05-07T20:32:31.7010241Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7010554Z     
2025-05-07T20:32:31.7010746Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.7010914Z 
2025-05-07T20:32:31.7011013Z moe/activation_test.py:117: 
2025-05-07T20:32:31.7011342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7011674Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.7011955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7012511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.7013082Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.7013735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.7014415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.7014956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.7015628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.7016298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.7016829Z     kernel = self.compile(
2025-05-07T20:32:31.7017370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.7018061Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.7018466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7018690Z 
2025-05-07T20:32:31.7018901Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1af2fe0>
2025-05-07T20:32:31.7020040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.7021398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0d090>}
2025-05-07T20:32:31.7022731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.7023767Z context = <triton._C.libtriton.ir.context object at 0x7fcec1a4adb0>
2025-05-07T20:32:31.7024054Z 
2025-05-07T20:32:31.7024230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.7024746Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.7025219Z                            module_map=module_map)
2025-05-07T20:32:31.7025588Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.7025944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.7026199Z E       ^
2025-05-07T20:32:31.7026659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.7027102Z 
2025-05-07T20:32:31.7027519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.7028023Z 
2025-05-07T20:32:31.8317441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8318068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8318479Z     T=1,
2025-05-07T20:32:31.8318666Z     D=5120,
2025-05-07T20:32:31.8318859Z     scale_ub=None,
2025-05-07T20:32:31.8319073Z     contiguous=False,
2025-05-07T20:32:31.8319308Z     compiled=False,
2025-05-07T20:32:31.8319520Z )
2025-05-07T20:32:31.8319834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8320321Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.8320661Z 
2025-05-07T20:32:31.8320744Z     @given(
2025-05-07T20:32:31.8321030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8321339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8321648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8321977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8322302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8322593Z     )
2025-05-07T20:32:31.8322938Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8323381Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8323621Z         self,
2025-05-07T20:32:31.8323818Z         T: int,
2025-05-07T20:32:31.8324017Z         D: int,
2025-05-07T20:32:31.8324241Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8324515Z         contiguous: bool,
2025-05-07T20:32:31.8324753Z         compiled: bool,
2025-05-07T20:32:31.8324971Z     ) -> None:
2025-05-07T20:32:31.8325187Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8325422Z     
2025-05-07T20:32:31.8325701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8326043Z     
2025-05-07T20:32:31.8326237Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8326527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8326839Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8327150Z         x0 = x[:, :D]
2025-05-07T20:32:31.8327371Z         x1 = x[:, D:]
2025-05-07T20:32:31.8327574Z     
2025-05-07T20:32:31.8327752Z         if contiguous:
2025-05-07T20:32:31.8327976Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8328233Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8328472Z     
2025-05-07T20:32:31.8328658Z         if scale_ub is not None:
2025-05-07T20:32:31.8328936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8329271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8329578Z             )
2025-05-07T20:32:31.8329772Z         else:
2025-05-07T20:32:31.8330006Z             scale_ub_tensor = None
2025-05-07T20:32:31.8330279Z     
2025-05-07T20:32:31.8330511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8330816Z             op = silu_mul_quant
2025-05-07T20:32:31.8331060Z             if compiled:
2025-05-07T20:32:31.8331299Z                 op = torch.compile(op)
2025-05-07T20:32:31.8331602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8331872Z     
2025-05-07T20:32:31.8332056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.8332221Z 
2025-05-07T20:32:31.8332325Z moe/activation_test.py:117: 
2025-05-07T20:32:31.8332626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8332950Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.8333232Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8333927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.8334630Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.8335168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8335853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8336606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8337138Z     kernel = self.compile(
2025-05-07T20:32:31.8337681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8338336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8338736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8338962Z 
2025-05-07T20:32:31.8339173Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1a0b880>
2025-05-07T20:32:31.8340407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8341804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0d7e0>}
2025-05-07T20:32:31.8343157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8344181Z context = <triton._C.libtriton.ir.context object at 0x7fcec19fcbb0>
2025-05-07T20:32:31.8344465Z 
2025-05-07T20:32:31.8344630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8345157Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8345628Z                            module_map=module_map)
2025-05-07T20:32:31.8345986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8346344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.8346606Z E       ^
2025-05-07T20:32:31.8347120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8347574Z 
2025-05-07T20:32:31.8347987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8348505Z 
2025-05-07T20:32:31.8348607Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8349014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8349398Z     T=4096,
2025-05-07T20:32:31.8349580Z     D=7168,
2025-05-07T20:32:31.8349774Z     scale_ub=1200.0,
2025-05-07T20:32:31.8349992Z     contiguous=False,
2025-05-07T20:32:31.8350217Z     compiled=False,
2025-05-07T20:32:31.8350420Z )
2025-05-07T20:32:31.8350734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8351212Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.8351486Z 
2025-05-07T20:32:31.8351559Z     @given(
2025-05-07T20:32:31.8351789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8352090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8352395Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8352715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8353030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8353336Z     )
2025-05-07T20:32:31.8353680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8354115Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8354357Z         self,
2025-05-07T20:32:31.8354545Z         T: int,
2025-05-07T20:32:31.8354733Z         D: int,
2025-05-07T20:32:31.8354949Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8355214Z         contiguous: bool,
2025-05-07T20:32:31.8355452Z         compiled: bool,
2025-05-07T20:32:31.8355673Z     ) -> None:
2025-05-07T20:32:31.8355883Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8356167Z     
2025-05-07T20:32:31.8356445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8356778Z     
2025-05-07T20:32:31.8356970Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8357258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8357558Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8357795Z         x0 = x[:, :D]
2025-05-07T20:32:31.8358006Z         x1 = x[:, D:]
2025-05-07T20:32:31.8358202Z     
2025-05-07T20:32:31.8358384Z         if contiguous:
2025-05-07T20:32:31.8358617Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8358913Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8359152Z     
2025-05-07T20:32:31.8359385Z         if scale_ub is not None:
2025-05-07T20:32:31.8359657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8360017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8360334Z             )
2025-05-07T20:32:31.8366656Z         else:
2025-05-07T20:32:31.8366889Z             scale_ub_tensor = None
2025-05-07T20:32:31.8367156Z     
2025-05-07T20:32:31.8367404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8367727Z             op = silu_mul_quant
2025-05-07T20:32:31.8367980Z             if compiled:
2025-05-07T20:32:31.8368233Z                 op = torch.compile(op)
2025-05-07T20:32:31.8368531Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8368799Z     
2025-05-07T20:32:31.8368994Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.8369161Z 
2025-05-07T20:32:31.8369269Z moe/activation_test.py:117: 
2025-05-07T20:32:31.8369567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8369907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.8370193Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8370887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.8371648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.8372195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8372873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8373530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8374061Z     kernel = self.compile(
2025-05-07T20:32:31.8374609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8375275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8375666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8375897Z 
2025-05-07T20:32:31.8376107Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1af8b50>
2025-05-07T20:32:31.8377186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8378582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0e200>}
2025-05-07T20:32:31.8380010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8381051Z context = <triton._C.libtriton.ir.context object at 0x7fcec1977fb0>
2025-05-07T20:32:31.8381346Z 
2025-05-07T20:32:31.8381513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8382037Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8382555Z                            module_map=module_map)
2025-05-07T20:32:31.8382920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8383278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.8383538Z E       ^
2025-05-07T20:32:31.8384004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8384455Z 
2025-05-07T20:32:31.8384866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8385430Z 
2025-05-07T20:32:31.8385542Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8385986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8386392Z     T=16384,
2025-05-07T20:32:31.8386585Z     D=7168,
2025-05-07T20:32:31.8386779Z     scale_ub=None,
2025-05-07T20:32:31.8386987Z     contiguous=True,
2025-05-07T20:32:31.8387213Z     compiled=True,
2025-05-07T20:32:31.8387417Z )
2025-05-07T20:32:32.0317919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0318512Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.0318786Z 
2025-05-07T20:32:32.0318870Z     @given(
2025-05-07T20:32:32.0319210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0319555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0319859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0320205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0320537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0320815Z     )
2025-05-07T20:32:32.0321168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0321613Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0321848Z         self,
2025-05-07T20:32:32.0322047Z         T: int,
2025-05-07T20:32:32.0322252Z         D: int,
2025-05-07T20:32:32.0322601Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0322875Z         contiguous: bool,
2025-05-07T20:32:32.0323120Z         compiled: bool,
2025-05-07T20:32:32.0323357Z     ) -> None:
2025-05-07T20:32:32.0323575Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0323826Z     
2025-05-07T20:32:32.0324107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0324443Z     
2025-05-07T20:32:32.0324644Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0324941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0325245Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0325486Z         x0 = x[:, :D]
2025-05-07T20:32:32.0325708Z         x1 = x[:, D:]
2025-05-07T20:32:32.0325911Z     
2025-05-07T20:32:32.0326105Z         if contiguous:
2025-05-07T20:32:32.0326339Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0326605Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0326851Z     
2025-05-07T20:32:32.0327057Z         if scale_ub is not None:
2025-05-07T20:32:32.0327330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0327655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0327964Z             )
2025-05-07T20:32:32.0328153Z         else:
2025-05-07T20:32:32.0328366Z             scale_ub_tensor = None
2025-05-07T20:32:32.0328617Z     
2025-05-07T20:32:32.0328854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0329163Z             op = silu_mul_quant
2025-05-07T20:32:32.0329413Z             if compiled:
2025-05-07T20:32:32.0329685Z                 op = torch.compile(op)
2025-05-07T20:32:32.0330072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0330352Z     
2025-05-07T20:32:32.0330545Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0330710Z 
2025-05-07T20:32:32.0330815Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0331128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0331560Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0331854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0332414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.0332974Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.0333633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0334316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0334923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0335663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0336324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0336849Z     kernel = self.compile(
2025-05-07T20:32:32.0337399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0338054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0338450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0338672Z 
2025-05-07T20:32:32.0338883Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1931c30>
2025-05-07T20:32:32.0340061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0341561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0f760>}
2025-05-07T20:32:32.0342968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0343993Z context = <triton._C.libtriton.ir.context object at 0x7fcec19b8e70>
2025-05-07T20:32:32.0344289Z 
2025-05-07T20:32:32.0344456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0344973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0345448Z                            module_map=module_map)
2025-05-07T20:32:32.0345812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0346173Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0346443Z E       ^
2025-05-07T20:32:32.0346909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0347368Z 
2025-05-07T20:32:32.0347792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.0348317Z 
2025-05-07T20:32:32.0348426Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0348844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0349236Z     T=4096,
2025-05-07T20:32:32.0349434Z     D=5120,
2025-05-07T20:32:32.0349634Z     scale_ub=None,
2025-05-07T20:32:32.0349855Z     contiguous=False,
2025-05-07T20:32:32.0350075Z     compiled=True,
2025-05-07T20:32:32.0350280Z )
2025-05-07T20:32:32.0350598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0351105Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.0351477Z 
2025-05-07T20:32:32.0351562Z     @given(
2025-05-07T20:32:32.0351792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0352105Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0352473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0352810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0353138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0353423Z     )
2025-05-07T20:32:32.0353780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0354224Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0354467Z         self,
2025-05-07T20:32:32.0354667Z         T: int,
2025-05-07T20:32:32.0354864Z         D: int,
2025-05-07T20:32:32.0355144Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0355414Z         contiguous: bool,
2025-05-07T20:32:32.0355697Z         compiled: bool,
2025-05-07T20:32:32.0355921Z     ) -> None:
2025-05-07T20:32:32.0356146Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0356390Z     
2025-05-07T20:32:32.0356665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0357007Z     
2025-05-07T20:32:32.0357209Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0357499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0357812Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0358057Z         x0 = x[:, :D]
2025-05-07T20:32:32.0358274Z         x1 = x[:, D:]
2025-05-07T20:32:32.0358481Z     
2025-05-07T20:32:32.0358668Z         if contiguous:
2025-05-07T20:32:32.0358900Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0359164Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0359406Z     
2025-05-07T20:32:32.0359599Z         if scale_ub is not None:
2025-05-07T20:32:32.0359897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0360266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0360578Z             )
2025-05-07T20:32:32.0360763Z         else:
2025-05-07T20:32:32.0360978Z             scale_ub_tensor = None
2025-05-07T20:32:32.0361230Z     
2025-05-07T20:32:32.0361461Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0361832Z             op = silu_mul_quant
2025-05-07T20:32:32.0362181Z             if compiled:
2025-05-07T20:32:32.0362431Z                 op = torch.compile(op)
2025-05-07T20:32:32.0362730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0363004Z     
2025-05-07T20:32:32.0363193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0363366Z 
2025-05-07T20:32:32.0363470Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0363762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0364093Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0364375Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0364934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.0365494Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.0366154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0366842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0367376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0368061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0368721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0369252Z     kernel = self.compile(
2025-05-07T20:32:32.0369797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0370460Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0370851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0371084Z 
2025-05-07T20:32:32.0371297Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1905cf0>
2025-05-07T20:32:32.0372436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0373919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f0280>}
2025-05-07T20:32:32.0375344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0376422Z context = <triton._C.libtriton.ir.context object at 0x7fd09606b730>
2025-05-07T20:32:32.0376708Z 
2025-05-07T20:32:32.0376880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0377403Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0377868Z                            module_map=module_map)
2025-05-07T20:32:32.0378232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0378585Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0378842Z E       ^
2025-05-07T20:32:32.0379301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0379750Z 
2025-05-07T20:32:32.0380280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.0380797Z 
2025-05-07T20:32:32.3665002Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.3665975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.3666774Z     T=4096,
2025-05-07T20:32:32.3667142Z     D=5120,
2025-05-07T20:32:32.3667513Z     scale_ub=1200.0,
2025-05-07T20:32:32.3667959Z     contiguous=False,
2025-05-07T20:32:32.3668606Z     compiled=False,
2025-05-07T20:32:32.3669003Z )
2025-05-07T20:32:32.3669628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.3670252Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.3670528Z 
2025-05-07T20:32:32.3670614Z     @given(
2025-05-07T20:32:32.3670843Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.3671161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.3671470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.3671794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.3672132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.3672415Z     )
2025-05-07T20:32:32.3672764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.3673210Z     def test_silu_mul_quant(
2025-05-07T20:32:32.3673461Z         self,
2025-05-07T20:32:32.3673658Z         T: int,
2025-05-07T20:32:32.3673853Z         D: int,
2025-05-07T20:32:32.3674073Z         scale_ub: Optional[float],
2025-05-07T20:32:32.3674347Z         contiguous: bool,
2025-05-07T20:32:32.3674587Z         compiled: bool,
2025-05-07T20:32:32.3674809Z     ) -> None:
2025-05-07T20:32:32.3675029Z         torch.manual_seed(2025)
2025-05-07T20:32:32.3675265Z     
2025-05-07T20:32:32.3675543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.3675897Z     
2025-05-07T20:32:32.3676093Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.3676389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.3676702Z         x = x_sign * x_clamp
2025-05-07T20:32:32.3676941Z         x0 = x[:, :D]
2025-05-07T20:32:32.3677154Z         x1 = x[:, D:]
2025-05-07T20:32:32.3677357Z     
2025-05-07T20:32:32.3677542Z         if contiguous:
2025-05-07T20:32:32.3677776Z             x0 = x0.contiguous()
2025-05-07T20:32:32.3678116Z             x1 = x1.contiguous()
2025-05-07T20:32:32.3678352Z     
2025-05-07T20:32:32.3678542Z         if scale_ub is not None:
2025-05-07T20:32:32.3678817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.3679154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.3679455Z             )
2025-05-07T20:32:32.3679647Z         else:
2025-05-07T20:32:32.3679861Z             scale_ub_tensor = None
2025-05-07T20:32:32.3680101Z     
2025-05-07T20:32:32.3680327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.3680713Z             op = silu_mul_quant
2025-05-07T20:32:32.3680957Z             if compiled:
2025-05-07T20:32:32.3681264Z                 op = torch.compile(op)
2025-05-07T20:32:32.3681564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.3681830Z     
2025-05-07T20:32:32.3682020Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.3682182Z 
2025-05-07T20:32:32.3682287Z moe/activation_test.py:117: 
2025-05-07T20:32:32.3682580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.3682907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.3683187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.3683872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.3684550Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.3685088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.3685768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.3686424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.3686950Z     kernel = self.compile(
2025-05-07T20:32:32.3687482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.3688191Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.3688579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.3688807Z 
2025-05-07T20:32:32.3689016Z self = <triton.compiler.compiler.ASTSource object at 0x7fd096036230>
2025-05-07T20:32:32.3690434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.3691808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f1000>}
2025-05-07T20:32:32.3693151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.3694155Z context = <triton._C.libtriton.ir.context object at 0x7fd096084c70>
2025-05-07T20:32:32.3694440Z 
2025-05-07T20:32:32.3694602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.3695113Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.3695572Z                            module_map=module_map)
2025-05-07T20:32:32.3695932Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.3696283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.3696537Z E       ^
2025-05-07T20:32:32.3696991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.3697439Z 
2025-05-07T20:32:32.3697849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.3698451Z 
2025-05-07T20:32:32.3698558Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.3698964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.3699353Z     T=4096,
2025-05-07T20:32:32.3699533Z     D=5120,
2025-05-07T20:32:32.3699718Z     scale_ub=1200.0,
2025-05-07T20:32:32.3700010Z     contiguous=False,
2025-05-07T20:32:32.3700254Z     compiled=True,
2025-05-07T20:32:32.3700475Z )
2025-05-07T20:32:32.3700780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.3701350Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.3701617Z 
2025-05-07T20:32:32.3701758Z     @given(
2025-05-07T20:32:32.3701985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.3702297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.3702603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.3702958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.3703291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.3703566Z     )
2025-05-07T20:32:32.3703918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.3704350Z     def test_silu_mul_quant(
2025-05-07T20:32:32.3704580Z         self,
2025-05-07T20:32:32.3704771Z         T: int,
2025-05-07T20:32:32.3704965Z         D: int,
2025-05-07T20:32:32.3705176Z         scale_ub: Optional[float],
2025-05-07T20:32:32.3705445Z         contiguous: bool,
2025-05-07T20:32:32.3705687Z         compiled: bool,
2025-05-07T20:32:32.3705901Z     ) -> None:
2025-05-07T20:32:32.3706118Z         torch.manual_seed(2025)
2025-05-07T20:32:32.3706362Z     
2025-05-07T20:32:32.3706627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.3706964Z     
2025-05-07T20:32:32.3707158Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.3707440Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.3707815Z         x = x_sign * x_clamp
2025-05-07T20:32:32.3708058Z         x0 = x[:, :D]
2025-05-07T20:32:32.3708273Z         x1 = x[:, D:]
2025-05-07T20:32:32.3708469Z     
2025-05-07T20:32:32.3708650Z         if contiguous:
2025-05-07T20:32:32.3708880Z             x0 = x0.contiguous()
2025-05-07T20:32:32.3709125Z             x1 = x1.contiguous()
2025-05-07T20:32:32.3709357Z     
2025-05-07T20:32:32.3709543Z         if scale_ub is not None:
2025-05-07T20:32:32.3709810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.3710148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.3710455Z             )
2025-05-07T20:32:32.3710645Z         else:
2025-05-07T20:32:32.3710853Z             scale_ub_tensor = None
2025-05-07T20:32:32.3711095Z     
2025-05-07T20:32:32.3711315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.3711622Z             op = silu_mul_quant
2025-05-07T20:32:32.3711873Z             if compiled:
2025-05-07T20:32:32.3712116Z                 op = torch.compile(op)
2025-05-07T20:32:32.3712404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.3712670Z     
2025-05-07T20:32:32.3712860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.3713023Z 
2025-05-07T20:32:32.3713121Z moe/activation_test.py:117: 
2025-05-07T20:32:32.3713418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.3713744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.3714014Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.3714572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.3715127Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.3715775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.3716463Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.3717047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.3717717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.3718370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.3718894Z     kernel = self.compile(
2025-05-07T20:32:32.3719425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.3720120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.3720552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.3720783Z 
2025-05-07T20:32:32.3720989Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1933250>
2025-05-07T20:32:32.3722055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.3723412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f0700>}
2025-05-07T20:32:32.3724731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.3725750Z context = <triton._C.libtriton.ir.context object at 0x7fcec1888f30>
2025-05-07T20:32:32.3726042Z 
2025-05-07T20:32:32.3726206Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.3726719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.3727187Z                            module_map=module_map)
2025-05-07T20:32:32.3727591Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.3727948Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.3728201Z E       ^
2025-05-07T20:32:32.3728660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.3729104Z 
2025-05-07T20:32:32.3729526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.3730046Z 
2025-05-07T20:32:32.5023997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.5024528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.5024953Z     T=2048,
2025-05-07T20:32:32.5025146Z     D=7168,
2025-05-07T20:32:32.5025341Z     scale_ub=1200.0,
2025-05-07T20:32:32.5025573Z     contiguous=False,
2025-05-07T20:32:32.5025797Z     compiled=False,
2025-05-07T20:32:32.5026005Z )
2025-05-07T20:32:32.5026327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.5026825Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.5027104Z 
2025-05-07T20:32:32.5027183Z     @given(
2025-05-07T20:32:32.5027417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.5027721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.5028033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.5028365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.5028700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.5028978Z     )
2025-05-07T20:32:32.5029334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.5029775Z     def test_silu_mul_quant(
2025-05-07T20:32:32.5030014Z         self,
2025-05-07T20:32:32.5030211Z         T: int,
2025-05-07T20:32:32.5030415Z         D: int,
2025-05-07T20:32:32.5030752Z         scale_ub: Optional[float],
2025-05-07T20:32:32.5031032Z         contiguous: bool,
2025-05-07T20:32:32.5031272Z         compiled: bool,
2025-05-07T20:32:32.5031499Z     ) -> None:
2025-05-07T20:32:32.5031725Z         torch.manual_seed(2025)
2025-05-07T20:32:32.5031975Z     
2025-05-07T20:32:32.5032247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.5032592Z     
2025-05-07T20:32:32.5032791Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.5033079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.5033480Z         x = x_sign * x_clamp
2025-05-07T20:32:32.5033724Z         x0 = x[:, :D]
2025-05-07T20:32:32.5033944Z         x1 = x[:, D:]
2025-05-07T20:32:32.5034207Z     
2025-05-07T20:32:32.5034400Z         if contiguous:
2025-05-07T20:32:32.5034634Z             x0 = x0.contiguous()
2025-05-07T20:32:32.5034888Z             x1 = x1.contiguous()
2025-05-07T20:32:32.5035124Z     
2025-05-07T20:32:32.5035324Z         if scale_ub is not None:
2025-05-07T20:32:32.5035597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.5035940Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.5036258Z             )
2025-05-07T20:32:32.5036447Z         else:
2025-05-07T20:32:32.5036663Z             scale_ub_tensor = None
2025-05-07T20:32:32.5036910Z     
2025-05-07T20:32:32.5037171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.5037490Z             op = silu_mul_quant
2025-05-07T20:32:32.5037734Z             if compiled:
2025-05-07T20:32:32.5037989Z                 op = torch.compile(op)
2025-05-07T20:32:32.5038284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5038554Z     
2025-05-07T20:32:32.5038756Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.5038920Z 
2025-05-07T20:32:32.5039024Z moe/activation_test.py:117: 
2025-05-07T20:32:32.5039311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5039642Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.5039992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5040688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.5041366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.5041906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.5042582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.5043242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.5043770Z     kernel = self.compile(
2025-05-07T20:32:32.5044316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.5044959Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.5045358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5045585Z 
2025-05-07T20:32:32.5045790Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0960a49a0>
2025-05-07T20:32:32.5046856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.5048227Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f1240>}
2025-05-07T20:32:32.5049557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.5050576Z context = <triton._C.libtriton.ir.context object at 0x7fcec186e730>
2025-05-07T20:32:32.5050913Z 
2025-05-07T20:32:32.5051080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.5051604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.5052062Z                            module_map=module_map)
2025-05-07T20:32:32.5052429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.5052779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.5053029Z E       ^
2025-05-07T20:32:32.5053534Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.5054017Z 
2025-05-07T20:32:32.5054435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.5054947Z 
2025-05-07T20:32:32.5055057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.5055467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.5055867Z     T=1,
2025-05-07T20:32:32.5056048Z     D=7168,
2025-05-07T20:32:32.5056238Z     scale_ub=None,
2025-05-07T20:32:32.5056454Z     contiguous=True,
2025-05-07T20:32:32.5056676Z     compiled=False,
2025-05-07T20:32:32.5056876Z )
2025-05-07T20:32:32.5057192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.5057676Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.5057930Z 
2025-05-07T20:32:32.5058018Z     @given(
2025-05-07T20:32:32.5058244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.5058559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.5058868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.5059189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.5059514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.5059871Z     )
2025-05-07T20:32:32.5060265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.5060703Z     def test_silu_mul_quant(
2025-05-07T20:32:32.5060945Z         self,
2025-05-07T20:32:32.5061132Z         T: int,
2025-05-07T20:32:32.5061330Z         D: int,
2025-05-07T20:32:32.5061552Z         scale_ub: Optional[float],
2025-05-07T20:32:32.5061822Z         contiguous: bool,
2025-05-07T20:32:32.5062054Z         compiled: bool,
2025-05-07T20:32:32.5062278Z     ) -> None:
2025-05-07T20:32:32.5062497Z         torch.manual_seed(2025)
2025-05-07T20:32:32.5062733Z     
2025-05-07T20:32:32.5063005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.5063345Z     
2025-05-07T20:32:32.5063532Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.5063820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.5064124Z         x = x_sign * x_clamp
2025-05-07T20:32:32.5064357Z         x0 = x[:, :D]
2025-05-07T20:32:32.5064585Z         x1 = x[:, D:]
2025-05-07T20:32:32.5064797Z     
2025-05-07T20:32:32.5064981Z         if contiguous:
2025-05-07T20:32:32.5065216Z             x0 = x0.contiguous()
2025-05-07T20:32:32.5065476Z             x1 = x1.contiguous()
2025-05-07T20:32:32.5065707Z     
2025-05-07T20:32:32.5065901Z         if scale_ub is not None:
2025-05-07T20:32:32.5066173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.5066502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.5066803Z             )
2025-05-07T20:32:32.5066996Z         else:
2025-05-07T20:32:32.5067209Z             scale_ub_tensor = None
2025-05-07T20:32:32.5067452Z     
2025-05-07T20:32:32.5067684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.5067995Z             op = silu_mul_quant
2025-05-07T20:32:32.5068238Z             if compiled:
2025-05-07T20:32:32.5068489Z                 op = torch.compile(op)
2025-05-07T20:32:32.5068784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5069105Z     
2025-05-07T20:32:32.5069303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.5069466Z 
2025-05-07T20:32:32.5069573Z moe/activation_test.py:117: 
2025-05-07T20:32:32.5069868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5070196Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.5070479Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5071160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.5071895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.5072530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.5073220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.5073872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.5074403Z     kernel = self.compile(
2025-05-07T20:32:32.5074941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.5075589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.5075977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5076208Z 
2025-05-07T20:32:32.5076414Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec183c220>
2025-05-07T20:32:32.5077490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.5078905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f2050>}
2025-05-07T20:32:32.5080300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.5081331Z context = <triton._C.libtriton.ir.context object at 0x7fcec18ea430>
2025-05-07T20:32:32.5081619Z 
2025-05-07T20:32:32.5081782Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.5082295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.5082756Z                            module_map=module_map)
2025-05-07T20:32:32.5083122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.5083473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.5083729Z E       ^
2025-05-07T20:32:32.5084181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.5084634Z 
2025-05-07T20:32:32.5085045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.5085551Z 
2025-05-07T20:32:32.5085662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.5086066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.5086459Z     T=16384,
2025-05-07T20:32:32.5086653Z     D=7168,
2025-05-07T20:32:32.5086848Z     scale_ub=1200.0,
2025-05-07T20:32:32.5087071Z     contiguous=False,
2025-05-07T20:32:32.5087295Z     compiled=True,
2025-05-07T20:32:32.7715879Z )
2025-05-07T20:32:32.7716247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7716823Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.7717119Z 
2025-05-07T20:32:32.7717199Z     @given(
2025-05-07T20:32:32.7717440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7717868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7718181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7718519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7718852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7719134Z     )
2025-05-07T20:32:32.7719491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7719928Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7720166Z         self,
2025-05-07T20:32:32.7720436Z         T: int,
2025-05-07T20:32:32.7720635Z         D: int,
2025-05-07T20:32:32.7720852Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7721210Z         contiguous: bool,
2025-05-07T20:32:32.7721458Z         compiled: bool,
2025-05-07T20:32:32.7721684Z     ) -> None:
2025-05-07T20:32:32.7721905Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7722146Z     
2025-05-07T20:32:32.7722418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7722761Z     
2025-05-07T20:32:32.7722956Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7723258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7723564Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7723810Z         x0 = x[:, :D]
2025-05-07T20:32:32.7724032Z         x1 = x[:, D:]
2025-05-07T20:32:32.7724234Z     
2025-05-07T20:32:32.7724424Z         if contiguous:
2025-05-07T20:32:32.7724661Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7724922Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7725166Z     
2025-05-07T20:32:32.7725366Z         if scale_ub is not None:
2025-05-07T20:32:32.7725641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7725975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7726284Z             )
2025-05-07T20:32:32.7726468Z         else:
2025-05-07T20:32:32.7726685Z             scale_ub_tensor = None
2025-05-07T20:32:32.7726941Z     
2025-05-07T20:32:32.7727231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7727550Z             op = silu_mul_quant
2025-05-07T20:32:32.7727806Z             if compiled:
2025-05-07T20:32:32.7728058Z                 op = torch.compile(op)
2025-05-07T20:32:32.7728358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7728634Z     
2025-05-07T20:32:32.7728828Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7728995Z 
2025-05-07T20:32:32.7729097Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7729400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7729732Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7730017Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7730578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7731138Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7731799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7732476Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7733013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7733694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7734353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7734881Z     kernel = self.compile(
2025-05-07T20:32:32.7735428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7736074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7736463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7736749Z 
2025-05-07T20:32:32.7736965Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1868c10>
2025-05-07T20:32:32.7738052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7739430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f3490>}
2025-05-07T20:32:32.7740980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7742011Z context = <triton._C.libtriton.ir.context object at 0x7fcec1b868b0>
2025-05-07T20:32:32.7742299Z 
2025-05-07T20:32:32.7742463Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7742979Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7743435Z                            module_map=module_map)
2025-05-07T20:32:32.7743802Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7744160Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7744420Z E       ^
2025-05-07T20:32:32.7744875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7745333Z 
2025-05-07T20:32:32.7745747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7746252Z 
2025-05-07T20:32:32.7746364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7746772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7747161Z     T=1,
2025-05-07T20:32:32.7747345Z     D=7168,
2025-05-07T20:32:32.7747582Z     scale_ub=None,
2025-05-07T20:32:32.7747797Z     contiguous=False,
2025-05-07T20:32:32.7748022Z     compiled=False,
2025-05-07T20:32:32.7748220Z )
2025-05-07T20:32:32.7748529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7749009Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.7749265Z 
2025-05-07T20:32:32.7749345Z     @given(
2025-05-07T20:32:32.7749568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7749879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7750208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7750555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7750877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7751159Z     )
2025-05-07T20:32:32.7751507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7751947Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7752186Z         self,
2025-05-07T20:32:32.7752380Z         T: int,
2025-05-07T20:32:32.7752572Z         D: int,
2025-05-07T20:32:32.7752790Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7753063Z         contiguous: bool,
2025-05-07T20:32:32.7753297Z         compiled: bool,
2025-05-07T20:32:32.7753519Z     ) -> None:
2025-05-07T20:32:32.7753733Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7753965Z     
2025-05-07T20:32:32.7754232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7754565Z     
2025-05-07T20:32:32.7754749Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7755040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7755343Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7755583Z         x0 = x[:, :D]
2025-05-07T20:32:32.7755810Z         x1 = x[:, D:]
2025-05-07T20:32:32.7756014Z     
2025-05-07T20:32:32.7756198Z         if contiguous:
2025-05-07T20:32:32.7756481Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7756733Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7756970Z     
2025-05-07T20:32:32.7757154Z         if scale_ub is not None:
2025-05-07T20:32:32.7757417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7757744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7758041Z             )
2025-05-07T20:32:32.7758227Z         else:
2025-05-07T20:32:32.7758435Z             scale_ub_tensor = None
2025-05-07T20:32:32.7758688Z     
2025-05-07T20:32:32.7758962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7759272Z             op = silu_mul_quant
2025-05-07T20:32:32.7759561Z             if compiled:
2025-05-07T20:32:32.7759807Z                 op = torch.compile(op)
2025-05-07T20:32:32.7760122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7760419Z     
2025-05-07T20:32:32.7760607Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7760780Z 
2025-05-07T20:32:32.7760879Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7761177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7761502Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7761779Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7762462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7763144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7763678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7764355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7765016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7765533Z     kernel = self.compile(
2025-05-07T20:32:32.7766119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7766777Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7767164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7767387Z 
2025-05-07T20:32:32.7767594Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1bfaa10>
2025-05-07T20:32:32.7768658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7770017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f37f0>}
2025-05-07T20:32:32.7771349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7772359Z context = <triton._C.libtriton.ir.context object at 0x7fcec1b38530>
2025-05-07T20:32:32.7772649Z 
2025-05-07T20:32:32.7772812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7773326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7773785Z                            module_map=module_map)
2025-05-07T20:32:32.7774147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7774492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7774747Z E       ^
2025-05-07T20:32:32.7775206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7775645Z 
2025-05-07T20:32:32.7776054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7776611Z 
2025-05-07T20:32:32.7776714Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7777120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7777511Z     T=2048,
2025-05-07T20:32:32.7777688Z     D=7168,
2025-05-07T20:32:32.7777876Z     scale_ub=None,
2025-05-07T20:32:32.7778087Z     contiguous=False,
2025-05-07T20:32:32.7778304Z     compiled=True,
2025-05-07T20:32:32.7778500Z )
2025-05-07T20:32:32.8779002Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8779674Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8780035Z 
2025-05-07T20:32:32.8780114Z     @given(
2025-05-07T20:32:32.8780352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8780664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8780960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8781301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8781634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8781921Z     )
2025-05-07T20:32:32.8782269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8782711Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8782953Z         self,
2025-05-07T20:32:32.8783151Z         T: int,
2025-05-07T20:32:32.8783354Z         D: int,
2025-05-07T20:32:32.8783584Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8783858Z         contiguous: bool,
2025-05-07T20:32:32.8784099Z         compiled: bool,
2025-05-07T20:32:32.8784332Z     ) -> None:
2025-05-07T20:32:32.8784544Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8784789Z     
2025-05-07T20:32:32.8785065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8785400Z     
2025-05-07T20:32:32.8785602Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8785960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8786288Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8786522Z         x0 = x[:, :D]
2025-05-07T20:32:32.8786748Z         x1 = x[:, D:]
2025-05-07T20:32:32.8786956Z     
2025-05-07T20:32:32.8787132Z         if contiguous:
2025-05-07T20:32:32.8787369Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8787634Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8787874Z     
2025-05-07T20:32:32.8794499Z         if scale_ub is not None:
2025-05-07T20:32:32.8794806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8795144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8795463Z             )
2025-05-07T20:32:32.8795663Z         else:
2025-05-07T20:32:32.8795869Z             scale_ub_tensor = None
2025-05-07T20:32:32.8796119Z     
2025-05-07T20:32:32.8796357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8796671Z             op = silu_mul_quant
2025-05-07T20:32:32.8796926Z             if compiled:
2025-05-07T20:32:32.8797177Z                 op = torch.compile(op)
2025-05-07T20:32:32.8797468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8797744Z     
2025-05-07T20:32:32.8797942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8798106Z 
2025-05-07T20:32:32.8798215Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8798512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8798843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8799125Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8799677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8800240Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8800896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8801691Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8802221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8802895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8803554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8804076Z     kernel = self.compile(
2025-05-07T20:32:32.8804612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8805393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8805790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8806014Z 
2025-05-07T20:32:32.8806220Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09606d990>
2025-05-07T20:32:32.8807291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8808655Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b50af0>}
2025-05-07T20:32:32.8809988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8811068Z context = <triton._C.libtriton.ir.context object at 0x7fcec1cf57f0>
2025-05-07T20:32:32.8811360Z 
2025-05-07T20:32:32.8811525Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8812039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8812563Z                            module_map=module_map)
2025-05-07T20:32:32.8812922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8813275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8813532Z E       ^
2025-05-07T20:32:32.8813984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8814428Z 
2025-05-07T20:32:32.8814840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8815354Z 
2025-05-07T20:32:32.8815456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8815866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8816273Z     T=4096,
2025-05-07T20:32:32.8816461Z     D=7168,
2025-05-07T20:32:32.8816653Z     scale_ub=None,
2025-05-07T20:32:32.8816861Z     contiguous=False,
2025-05-07T20:32:32.8817091Z     compiled=True,
2025-05-07T20:32:32.8817292Z )
2025-05-07T20:32:32.8817601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8818088Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8818357Z 
2025-05-07T20:32:32.8818433Z     @given(
2025-05-07T20:32:32.8818663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8818969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8819279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8819607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8820092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8820487Z     )
2025-05-07T20:32:32.8820906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8821387Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8821634Z         self,
2025-05-07T20:32:32.8821897Z         T: int,
2025-05-07T20:32:32.8822105Z         D: int,
2025-05-07T20:32:32.8822322Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8822594Z         contiguous: bool,
2025-05-07T20:32:32.8822833Z         compiled: bool,
2025-05-07T20:32:32.8823057Z     ) -> None:
2025-05-07T20:32:32.8823265Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8823507Z     
2025-05-07T20:32:32.8823779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8824115Z     
2025-05-07T20:32:32.8824311Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8824652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8824956Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8825240Z         x0 = x[:, :D]
2025-05-07T20:32:32.8825463Z         x1 = x[:, D:]
2025-05-07T20:32:32.8825664Z     
2025-05-07T20:32:32.8825850Z         if contiguous:
2025-05-07T20:32:32.8826082Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8826336Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8826575Z     
2025-05-07T20:32:32.8826772Z         if scale_ub is not None:
2025-05-07T20:32:32.8827039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8827372Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8827677Z             )
2025-05-07T20:32:32.8827866Z         else:
2025-05-07T20:32:32.8828079Z             scale_ub_tensor = None
2025-05-07T20:32:32.8828329Z     
2025-05-07T20:32:32.8828562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8828868Z             op = silu_mul_quant
2025-05-07T20:32:32.8829123Z             if compiled:
2025-05-07T20:32:32.8829377Z                 op = torch.compile(op)
2025-05-07T20:32:32.8829670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8829938Z     
2025-05-07T20:32:32.8830134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8830296Z 
2025-05-07T20:32:32.8830397Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8830740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8831076Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8831352Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8831903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8832457Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8833113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8833794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8834331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8835003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8835662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8836196Z     kernel = self.compile(
2025-05-07T20:32:32.8836739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8837387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8837778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8838007Z 
2025-05-07T20:32:32.8838214Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1b652a0>
2025-05-07T20:32:32.8839281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8840642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b50280>}
2025-05-07T20:32:32.8842020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8843048Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ca42b0>
2025-05-07T20:32:32.8843340Z 
2025-05-07T20:32:32.8843505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8844028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8844533Z                            module_map=module_map)
2025-05-07T20:32:32.8844895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8845285Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8845551Z E       ^
2025-05-07T20:32:32.8846010Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8846459Z 
2025-05-07T20:32:32.8846884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8847403Z 
2025-05-07T20:32:33.2324099Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2324563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2324999Z     T=16384,
2025-05-07T20:32:33.2325202Z     D=5120,
2025-05-07T20:32:33.2325398Z     scale_ub=1200.0,
2025-05-07T20:32:33.2325627Z     contiguous=False,
2025-05-07T20:32:33.2325854Z     compiled=False,
2025-05-07T20:32:33.2326075Z )
2025-05-07T20:32:33.2326397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2326901Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.2327218Z 
2025-05-07T20:32:33.2327299Z     @given(
2025-05-07T20:32:33.2327527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2327844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2328270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2328600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2328932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2329216Z     )
2025-05-07T20:32:33.2329564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2330011Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2330261Z         self,
2025-05-07T20:32:33.2330450Z         T: int,
2025-05-07T20:32:33.2330654Z         D: int,
2025-05-07T20:32:33.2330880Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2331156Z         contiguous: bool,
2025-05-07T20:32:33.2331403Z         compiled: bool,
2025-05-07T20:32:33.2331639Z     ) -> None:
2025-05-07T20:32:33.2331853Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2332089Z     
2025-05-07T20:32:33.2332364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2332706Z     
2025-05-07T20:32:33.2332900Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2333192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2333499Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2333737Z         x0 = x[:, :D]
2025-05-07T20:32:33.2333955Z         x1 = x[:, D:]
2025-05-07T20:32:33.2334162Z     
2025-05-07T20:32:33.2334340Z         if contiguous:
2025-05-07T20:32:33.2334576Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2334836Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2335076Z     
2025-05-07T20:32:33.2335266Z         if scale_ub is not None:
2025-05-07T20:32:33.2335542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2335874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2336180Z             )
2025-05-07T20:32:33.2336379Z         else:
2025-05-07T20:32:33.2336592Z             scale_ub_tensor = None
2025-05-07T20:32:33.2336839Z     
2025-05-07T20:32:33.2337150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2337463Z             op = silu_mul_quant
2025-05-07T20:32:33.2337719Z             if compiled:
2025-05-07T20:32:33.2337970Z                 op = torch.compile(op)
2025-05-07T20:32:33.2338262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2338532Z     
2025-05-07T20:32:33.2338730Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2338897Z 
2025-05-07T20:32:33.2339005Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2339297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2339700Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2340046Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2340816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2341497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2342037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2342716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2343367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2343892Z     kernel = self.compile(
2025-05-07T20:32:33.2344429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2345077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2345477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2345703Z 
2025-05-07T20:32:33.2345906Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1cbd570>
2025-05-07T20:32:33.2347016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2348382Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b52d40>}
2025-05-07T20:32:33.2349704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2350723Z context = <triton._C.libtriton.ir.context object at 0x7fcec1741370>
2025-05-07T20:32:33.2351006Z 
2025-05-07T20:32:33.2351179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2351695Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2352154Z                            module_map=module_map)
2025-05-07T20:32:33.2352521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2352869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2353116Z E       ^
2025-05-07T20:32:33.2353574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2354014Z 
2025-05-07T20:32:33.2354431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2354935Z 
2025-05-07T20:32:33.2355042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2355451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2355847Z     T=16384,
2025-05-07T20:32:33.2356036Z     D=5120,
2025-05-07T20:32:33.2356221Z     scale_ub=1200.0,
2025-05-07T20:32:33.2356437Z     contiguous=True,
2025-05-07T20:32:33.2356652Z     compiled=True,
2025-05-07T20:32:33.2356843Z )
2025-05-07T20:32:33.2357158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2357708Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.2357983Z 
2025-05-07T20:32:33.2358063Z     @given(
2025-05-07T20:32:33.2358291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2358608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2358910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2359230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2359559Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2359890Z     )
2025-05-07T20:32:33.2360272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2360718Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2360966Z         self,
2025-05-07T20:32:33.2361157Z         T: int,
2025-05-07T20:32:33.2361351Z         D: int,
2025-05-07T20:32:33.2361567Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2361844Z         contiguous: bool,
2025-05-07T20:32:33.2362079Z         compiled: bool,
2025-05-07T20:32:33.2362297Z     ) -> None:
2025-05-07T20:32:33.2362510Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2362742Z     
2025-05-07T20:32:33.2363009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2363351Z     
2025-05-07T20:32:33.2363536Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2363824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2364124Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2364361Z         x0 = x[:, :D]
2025-05-07T20:32:33.2364572Z         x1 = x[:, D:]
2025-05-07T20:32:33.2364767Z     
2025-05-07T20:32:33.2364946Z         if contiguous:
2025-05-07T20:32:33.2365173Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2365426Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2365656Z     
2025-05-07T20:32:33.2365846Z         if scale_ub is not None:
2025-05-07T20:32:33.2366120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2366493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2366801Z             )
2025-05-07T20:32:33.2366987Z         else:
2025-05-07T20:32:33.2367197Z             scale_ub_tensor = None
2025-05-07T20:32:33.2367438Z     
2025-05-07T20:32:33.2367665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2367969Z             op = silu_mul_quant
2025-05-07T20:32:33.2368215Z             if compiled:
2025-05-07T20:32:33.2368460Z                 op = torch.compile(op)
2025-05-07T20:32:33.2368754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2369017Z     
2025-05-07T20:32:33.2369204Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2369364Z 
2025-05-07T20:32:33.2369469Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2369753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2370079Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2370365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2370910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.2371458Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.2372109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2372789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2373312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2373990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2374650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2375176Z     kernel = self.compile(
2025-05-07T20:32:33.2375711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2376412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2376806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2377025Z 
2025-05-07T20:32:33.2377242Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1c844c0>
2025-05-07T20:32:33.2378301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2379743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b52830>}
2025-05-07T20:32:33.2381149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2382180Z context = <triton._C.libtriton.ir.context object at 0x7fcec1766170>
2025-05-07T20:32:33.2382462Z 
2025-05-07T20:32:33.2382624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2383144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2383603Z                            module_map=module_map)
2025-05-07T20:32:33.2383963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2384313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2384568Z E       ^
2025-05-07T20:32:33.2385027Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2385467Z 
2025-05-07T20:32:33.2385877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2386432Z 
2025-05-07T20:32:33.4275870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4276306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4276716Z     T=16384,
2025-05-07T20:32:33.4276945Z     D=5120,
2025-05-07T20:32:33.4277160Z     scale_ub=None,
2025-05-07T20:32:33.4277382Z     contiguous=False,
2025-05-07T20:32:33.4277614Z     compiled=True,
2025-05-07T20:32:33.4277821Z )
2025-05-07T20:32:33.4278151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4278660Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.4278938Z 
2025-05-07T20:32:33.4279028Z     @given(
2025-05-07T20:32:33.4279269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4279585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4279894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4280225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4280564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4280860Z     )
2025-05-07T20:32:33.4281209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4281654Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4281899Z         self,
2025-05-07T20:32:33.4282099Z         T: int,
2025-05-07T20:32:33.4282296Z         D: int,
2025-05-07T20:32:33.4282523Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4282797Z         contiguous: bool,
2025-05-07T20:32:33.4283041Z         compiled: bool,
2025-05-07T20:32:33.4283274Z     ) -> None:
2025-05-07T20:32:33.4283505Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4283744Z     
2025-05-07T20:32:33.4284023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4284374Z     
2025-05-07T20:32:33.4284571Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4284990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4285305Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4285539Z         x0 = x[:, :D]
2025-05-07T20:32:33.4285758Z         x1 = x[:, D:]
2025-05-07T20:32:33.4285963Z     
2025-05-07T20:32:33.4286139Z         if contiguous:
2025-05-07T20:32:33.4286366Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4286625Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4286853Z     
2025-05-07T20:32:33.4287057Z         if scale_ub is not None:
2025-05-07T20:32:33.4287328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4287734Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4288033Z             )
2025-05-07T20:32:33.4288287Z         else:
2025-05-07T20:32:33.4288504Z             scale_ub_tensor = None
2025-05-07T20:32:33.4288752Z     
2025-05-07T20:32:33.4288985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4289293Z             op = silu_mul_quant
2025-05-07T20:32:33.4289543Z             if compiled:
2025-05-07T20:32:33.4289797Z                 op = torch.compile(op)
2025-05-07T20:32:33.4290250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4290518Z     
2025-05-07T20:32:33.4290709Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4290873Z 
2025-05-07T20:32:33.4290976Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4291270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4291604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4291887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4292458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.4293012Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.4293676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4294369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4294993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4295671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4296338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4296866Z     kernel = self.compile(
2025-05-07T20:32:33.4297400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4298058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4298456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4298682Z 
2025-05-07T20:32:33.4298901Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1cb8850>
2025-05-07T20:32:33.4300095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4301469Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b53760>}
2025-05-07T20:32:33.4302801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4303828Z context = <triton._C.libtriton.ir.context object at 0x7fcec16fc830>
2025-05-07T20:32:33.4304125Z 
2025-05-07T20:32:33.4304295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4304808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4305341Z                            module_map=module_map)
2025-05-07T20:32:33.4305708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4306052Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4306313Z E       ^
2025-05-07T20:32:33.4312650Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4313102Z 
2025-05-07T20:32:33.4313523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4314159Z 
2025-05-07T20:32:33.4314269Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4314743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4315159Z     T=2048,
2025-05-07T20:32:33.4315350Z     D=5120,
2025-05-07T20:32:33.4315545Z     scale_ub=None,
2025-05-07T20:32:33.4315768Z     contiguous=False,
2025-05-07T20:32:33.4315996Z     compiled=True,
2025-05-07T20:32:33.4316203Z )
2025-05-07T20:32:33.5344990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5345554Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.5345839Z 
2025-05-07T20:32:33.5345921Z     @given(
2025-05-07T20:32:33.5346163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5346479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5346790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5347126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5347465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5347754Z     )
2025-05-07T20:32:33.5348124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5348565Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5348809Z         self,
2025-05-07T20:32:33.5349015Z         T: int,
2025-05-07T20:32:33.5349214Z         D: int,
2025-05-07T20:32:33.5349560Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5349854Z         contiguous: bool,
2025-05-07T20:32:33.5350091Z         compiled: bool,
2025-05-07T20:32:33.5350332Z     ) -> None:
2025-05-07T20:32:33.5350569Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5350824Z     
2025-05-07T20:32:33.5351093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5351447Z     
2025-05-07T20:32:33.5351648Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5351942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5352262Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5352504Z         x0 = x[:, :D]
2025-05-07T20:32:33.5352720Z         x1 = x[:, D:]
2025-05-07T20:32:33.5352933Z     
2025-05-07T20:32:33.5353130Z         if contiguous:
2025-05-07T20:32:33.5353363Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5353624Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5353865Z     
2025-05-07T20:32:33.5354061Z         if scale_ub is not None:
2025-05-07T20:32:33.5354347Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5354697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5355004Z             )
2025-05-07T20:32:33.5355209Z         else:
2025-05-07T20:32:33.5355433Z             scale_ub_tensor = None
2025-05-07T20:32:33.5355689Z     
2025-05-07T20:32:33.5355925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5356247Z             op = silu_mul_quant
2025-05-07T20:32:33.5356504Z             if compiled:
2025-05-07T20:32:33.5356759Z                 op = torch.compile(op)
2025-05-07T20:32:33.5357060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5357344Z     
2025-05-07T20:32:33.5357539Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.5357721Z 
2025-05-07T20:32:33.5357825Z moe/activation_test.py:117: 
2025-05-07T20:32:33.5358121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5358523Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.5358813Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5359377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.5359943Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.5360600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.5361289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.5361892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5362621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5363291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5363829Z     kernel = self.compile(
2025-05-07T20:32:33.5364373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5365025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5365423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5365647Z 
2025-05-07T20:32:33.5365862Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1780220>
2025-05-07T20:32:33.5366941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5368301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164c3a0>}
2025-05-07T20:32:33.5369681Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5370727Z context = <triton._C.libtriton.ir.context object at 0x7fcec1613530>
2025-05-07T20:32:33.5371017Z 
2025-05-07T20:32:33.5371188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5371715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5372185Z                            module_map=module_map)
2025-05-07T20:32:33.5372553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5372913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5373163Z E       ^
2025-05-07T20:32:33.5373627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5374073Z 
2025-05-07T20:32:33.5374496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.5375006Z 
2025-05-07T20:32:33.5375115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5375683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5376080Z     T=2048,
2025-05-07T20:32:33.5376269Z     D=5120,
2025-05-07T20:32:33.5376455Z     scale_ub=1200.0,
2025-05-07T20:32:33.5376679Z     contiguous=False,
2025-05-07T20:32:33.5376897Z     compiled=True,
2025-05-07T20:32:33.5377096Z )
2025-05-07T20:32:33.5377412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5377909Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.5378174Z 
2025-05-07T20:32:33.5378251Z     @given(
2025-05-07T20:32:33.5378479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5378788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5379147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5379471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5379884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5380171Z     )
2025-05-07T20:32:33.5380512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5380948Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5381188Z         self,
2025-05-07T20:32:33.5381380Z         T: int,
2025-05-07T20:32:33.5381583Z         D: int,
2025-05-07T20:32:33.5381847Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5382113Z         contiguous: bool,
2025-05-07T20:32:33.5382354Z         compiled: bool,
2025-05-07T20:32:33.5382616Z     ) -> None:
2025-05-07T20:32:33.5382824Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5383057Z     
2025-05-07T20:32:33.5383328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5383664Z     
2025-05-07T20:32:33.5383849Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5384139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5384451Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5384685Z         x0 = x[:, :D]
2025-05-07T20:32:33.5384901Z         x1 = x[:, D:]
2025-05-07T20:32:33.5385105Z     
2025-05-07T20:32:33.5385285Z         if contiguous:
2025-05-07T20:32:33.5385512Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5385772Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5386001Z     
2025-05-07T20:32:33.5386193Z         if scale_ub is not None:
2025-05-07T20:32:33.5386458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5386791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5387095Z             )
2025-05-07T20:32:33.5387282Z         else:
2025-05-07T20:32:33.5387488Z             scale_ub_tensor = None
2025-05-07T20:32:33.5387740Z     
2025-05-07T20:32:33.5387968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5388322Z             op = silu_mul_quant
2025-05-07T20:32:33.5388577Z             if compiled:
2025-05-07T20:32:33.5388826Z                 op = torch.compile(op)
2025-05-07T20:32:33.5389118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5389381Z     
2025-05-07T20:32:33.5389572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.5389734Z 
2025-05-07T20:32:33.5390043Z moe/activation_test.py:117: 
2025-05-07T20:32:33.5390409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5390742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.5391026Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5391570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.5393605Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.5394255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.5394953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.5395475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5396159Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5396814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5397342Z     kernel = self.compile(
2025-05-07T20:32:33.5397871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5398520Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5398911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5399133Z 
2025-05-07T20:32:33.5399352Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1688a90>
2025-05-07T20:32:33.5400500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5401846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164c820>}
2025-05-07T20:32:33.5403173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5404309Z context = <triton._C.libtriton.ir.context object at 0x7fcec1659c70>
2025-05-07T20:32:33.5404593Z 
2025-05-07T20:32:33.5404757Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5405273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5405731Z                            module_map=module_map)
2025-05-07T20:32:33.5406091Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5406437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5406688Z E       ^
2025-05-07T20:32:33.5407151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5407596Z 
2025-05-07T20:32:33.5408005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.5408516Z 
2025-05-07T20:32:33.9083613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9084044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9084479Z     T=4096,
2025-05-07T20:32:33.9084676Z     D=5120,
2025-05-07T20:32:33.9084899Z     scale_ub=1200.0,
2025-05-07T20:32:33.9085131Z     contiguous=True,
2025-05-07T20:32:33.9085500Z     compiled=True,
2025-05-07T20:32:33.9085714Z )
2025-05-07T20:32:33.9086031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9086519Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.9086785Z 
2025-05-07T20:32:33.9086873Z     @given(
2025-05-07T20:32:33.9087105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9087416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9087725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9088059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9088384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9088657Z     )
2025-05-07T20:32:33.9089006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9089441Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9089684Z         self,
2025-05-07T20:32:33.9090037Z         T: int,
2025-05-07T20:32:33.9090240Z         D: int,
2025-05-07T20:32:33.9090473Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9090770Z         contiguous: bool,
2025-05-07T20:32:33.9091028Z         compiled: bool,
2025-05-07T20:32:33.9091251Z     ) -> None:
2025-05-07T20:32:33.9091466Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9091701Z     
2025-05-07T20:32:33.9091975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9092322Z     
2025-05-07T20:32:33.9092513Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9092813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9093122Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9093371Z         x0 = x[:, :D]
2025-05-07T20:32:33.9093590Z         x1 = x[:, D:]
2025-05-07T20:32:33.9093793Z     
2025-05-07T20:32:33.9093976Z         if contiguous:
2025-05-07T20:32:33.9094208Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9094461Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9094780Z     
2025-05-07T20:32:33.9094971Z         if scale_ub is not None:
2025-05-07T20:32:33.9095246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9095582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9095881Z             )
2025-05-07T20:32:33.9096070Z         else:
2025-05-07T20:32:33.9096282Z             scale_ub_tensor = None
2025-05-07T20:32:33.9096526Z     
2025-05-07T20:32:33.9096761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9097082Z             op = silu_mul_quant
2025-05-07T20:32:33.9097456Z             if compiled:
2025-05-07T20:32:33.9097709Z                 op = torch.compile(op)
2025-05-07T20:32:33.9098062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9098345Z     
2025-05-07T20:32:33.9098532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.9098702Z 
2025-05-07T20:32:33.9098807Z moe/activation_test.py:117: 
2025-05-07T20:32:33.9099111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9099444Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.9099730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9100365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.9100915Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.9101575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.9102277Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.9102811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9103478Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9104139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9104731Z     kernel = self.compile(
2025-05-07T20:32:33.9105278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9105920Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9106314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9106536Z 
2025-05-07T20:32:33.9106746Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1637dc0>
2025-05-07T20:32:33.9107810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9109164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164d360>}
2025-05-07T20:32:33.9110501Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9111507Z context = <triton._C.libtriton.ir.context object at 0x7fcec1584b30>
2025-05-07T20:32:33.9111789Z 
2025-05-07T20:32:33.9111954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9112467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9112932Z                            module_map=module_map)
2025-05-07T20:32:33.9113296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9113640Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.9113895Z E       ^
2025-05-07T20:32:33.9114359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9114851Z 
2025-05-07T20:32:33.9115269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9115772Z 
2025-05-07T20:32:33.9115875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9116280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9116673Z     T=128,
2025-05-07T20:32:33.9116852Z     D=5120,
2025-05-07T20:32:33.9117039Z     scale_ub=1200.0,
2025-05-07T20:32:33.9117261Z     contiguous=False,
2025-05-07T20:32:33.9117536Z     compiled=True,
2025-05-07T20:32:33.9117731Z )
2025-05-07T20:32:34.0277201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.0277867Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.0278149Z 
2025-05-07T20:32:34.0278227Z     @given(
2025-05-07T20:32:34.0278465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.0278773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.0279086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.0279423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.0279742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.0280024Z     )
2025-05-07T20:32:34.0280380Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.0280820Z     def test_silu_mul_quant(
2025-05-07T20:32:34.0281058Z         self,
2025-05-07T20:32:34.0281256Z         T: int,
2025-05-07T20:32:34.0281464Z         D: int,
2025-05-07T20:32:34.0281685Z         scale_ub: Optional[float],
2025-05-07T20:32:34.0281958Z         contiguous: bool,
2025-05-07T20:32:34.0282202Z         compiled: bool,
2025-05-07T20:32:34.0282417Z     ) -> None:
2025-05-07T20:32:34.0282625Z         torch.manual_seed(2025)
2025-05-07T20:32:34.0282864Z     
2025-05-07T20:32:34.0283126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.0283537Z     
2025-05-07T20:32:34.0283736Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.0284029Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.0284341Z         x = x_sign * x_clamp
2025-05-07T20:32:34.0284593Z         x0 = x[:, :D]
2025-05-07T20:32:34.0284806Z         x1 = x[:, D:]
2025-05-07T20:32:34.0285011Z     
2025-05-07T20:32:34.0285192Z         if contiguous:
2025-05-07T20:32:34.0285414Z             x0 = x0.contiguous()
2025-05-07T20:32:34.0285664Z             x1 = x1.contiguous()
2025-05-07T20:32:34.0285903Z     
2025-05-07T20:32:34.0286090Z         if scale_ub is not None:
2025-05-07T20:32:34.0286368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.0286704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.0287012Z             )
2025-05-07T20:32:34.0287201Z         else:
2025-05-07T20:32:34.0287421Z             scale_ub_tensor = None
2025-05-07T20:32:34.0287674Z     
2025-05-07T20:32:34.0287909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.0288225Z             op = silu_mul_quant
2025-05-07T20:32:34.0288481Z             if compiled:
2025-05-07T20:32:34.0288729Z                 op = torch.compile(op)
2025-05-07T20:32:34.0289030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0289309Z     
2025-05-07T20:32:34.0289498Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.0289666Z 
2025-05-07T20:32:34.0289766Z moe/activation_test.py:117: 
2025-05-07T20:32:34.0290315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0290660Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.0290946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0291512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.0292077Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.0292735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.0293524Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.0294062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.0294748Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.0295408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.0295938Z     kernel = self.compile(
2025-05-07T20:32:34.0296611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.0297264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.0297822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0298131Z 
2025-05-07T20:32:34.0298362Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec15d3d90>
2025-05-07T20:32:34.0299442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.0300922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164e290>}
2025-05-07T20:32:34.0302274Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.0303303Z context = <triton._C.libtriton.ir.context object at 0x7fcec158bdf0>
2025-05-07T20:32:34.0303590Z 
2025-05-07T20:32:34.0303768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.0304384Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.0304857Z                            module_map=module_map)
2025-05-07T20:32:34.0305233Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.0305588Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.0305846Z E       ^
2025-05-07T20:32:34.0306315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.0306759Z 
2025-05-07T20:32:34.0307184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.0307706Z 
2025-05-07T20:32:34.0307821Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.0308241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.0308644Z     T=16384,
2025-05-07T20:32:34.0308840Z     D=7168,
2025-05-07T20:32:34.0309043Z     scale_ub=1200.0,
2025-05-07T20:32:34.0309277Z     contiguous=True,
2025-05-07T20:32:34.0309510Z     compiled=True,
2025-05-07T20:32:34.0309715Z )
2025-05-07T20:32:34.0310033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.0310527Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.0310801Z 
2025-05-07T20:32:34.0310886Z     @given(
2025-05-07T20:32:34.0311119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.0311440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.0311755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.0312086Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.0312415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.0312706Z     )
2025-05-07T20:32:34.0313055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.0313497Z     def test_silu_mul_quant(
2025-05-07T20:32:34.0313798Z         self,
2025-05-07T20:32:34.0313993Z         T: int,
2025-05-07T20:32:34.0314198Z         D: int,
2025-05-07T20:32:34.0314422Z         scale_ub: Optional[float],
2025-05-07T20:32:34.0314700Z         contiguous: bool,
2025-05-07T20:32:34.0314940Z         compiled: bool,
2025-05-07T20:32:34.0315172Z     ) -> None:
2025-05-07T20:32:34.0315394Z         torch.manual_seed(2025)
2025-05-07T20:32:34.0315636Z     
2025-05-07T20:32:34.0315919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.0316316Z     
2025-05-07T20:32:34.0316511Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.0316807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.0317187Z         x = x_sign * x_clamp
2025-05-07T20:32:34.0317430Z         x0 = x[:, :D]
2025-05-07T20:32:34.0323719Z         x1 = x[:, D:]
2025-05-07T20:32:34.0323945Z     
2025-05-07T20:32:34.0324146Z         if contiguous:
2025-05-07T20:32:34.0324403Z             x0 = x0.contiguous()
2025-05-07T20:32:34.0324673Z             x1 = x1.contiguous()
2025-05-07T20:32:34.0324927Z     
2025-05-07T20:32:34.0325126Z         if scale_ub is not None:
2025-05-07T20:32:34.0325409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.0325748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.0326059Z             )
2025-05-07T20:32:34.0326257Z         else:
2025-05-07T20:32:34.0326471Z             scale_ub_tensor = None
2025-05-07T20:32:34.0326737Z     
2025-05-07T20:32:34.0326980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.0327303Z             op = silu_mul_quant
2025-05-07T20:32:34.0327565Z             if compiled:
2025-05-07T20:32:34.0327827Z                 op = torch.compile(op)
2025-05-07T20:32:34.0328127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0328406Z     
2025-05-07T20:32:34.0328607Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.0328776Z 
2025-05-07T20:32:34.0328882Z moe/activation_test.py:117: 
2025-05-07T20:32:34.0329261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0329601Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.0329890Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.0330461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.0331027Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.0331687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.0332371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.0332921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.0333610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.0334285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.0334815Z     kernel = self.compile(
2025-05-07T20:32:34.0335363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.0336021Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.0336423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.0336651Z 
2025-05-07T20:32:34.0336866Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec15b7e20>
2025-05-07T20:32:34.0337945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.0339331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164ed40>}
2025-05-07T20:32:34.0340800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.0341835Z context = <triton._C.libtriton.ir.context object at 0x7fcec13e1bf0>
2025-05-07T20:32:34.0342128Z 
2025-05-07T20:32:34.0342298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.0342816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.0343367Z                            module_map=module_map)
2025-05-07T20:32:34.0343733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.0344103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.0344371Z E       ^
2025-05-07T20:32:34.0344847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.0345302Z 
2025-05-07T20:32:34.0345718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.0346231Z 
2025-05-07T20:32:34.1708662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1709127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1709547Z     T=16384,
2025-05-07T20:32:34.1709748Z     D=5120,
2025-05-07T20:32:34.1709946Z     scale_ub=1200.0,
2025-05-07T20:32:34.1710184Z     contiguous=True,
2025-05-07T20:32:34.1710414Z     compiled=False,
2025-05-07T20:32:34.1710654Z )
2025-05-07T20:32:34.1710992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1711498Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.1711773Z 
2025-05-07T20:32:34.1711851Z     @given(
2025-05-07T20:32:34.1712201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1712519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1712827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1713154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1713488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1713769Z     )
2025-05-07T20:32:34.1714126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1714578Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1714820Z         self,
2025-05-07T20:32:34.1715012Z         T: int,
2025-05-07T20:32:34.1715217Z         D: int,
2025-05-07T20:32:34.1715445Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1715711Z         contiguous: bool,
2025-05-07T20:32:34.1715953Z         compiled: bool,
2025-05-07T20:32:34.1716180Z     ) -> None:
2025-05-07T20:32:34.1716392Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1716628Z     
2025-05-07T20:32:34.1716914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1717287Z     
2025-05-07T20:32:34.1717485Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1717770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1718080Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1718326Z         x0 = x[:, :D]
2025-05-07T20:32:34.1718538Z         x1 = x[:, D:]
2025-05-07T20:32:34.1718751Z     
2025-05-07T20:32:34.1718942Z         if contiguous:
2025-05-07T20:32:34.1719175Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1719446Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1719687Z     
2025-05-07T20:32:34.1719875Z         if scale_ub is not None:
2025-05-07T20:32:34.1720162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1720499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1720798Z             )
2025-05-07T20:32:34.1720994Z         else:
2025-05-07T20:32:34.1721210Z             scale_ub_tensor = None
2025-05-07T20:32:34.1721534Z     
2025-05-07T20:32:34.1721758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1722083Z             op = silu_mul_quant
2025-05-07T20:32:34.1722335Z             if compiled:
2025-05-07T20:32:34.1722589Z                 op = torch.compile(op)
2025-05-07T20:32:34.1722886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1723160Z     
2025-05-07T20:32:34.1723348Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1723521Z 
2025-05-07T20:32:34.1723622Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1723988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1724367Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1724652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1725351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1726044Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1726570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1727261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1727924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1728448Z     kernel = self.compile(
2025-05-07T20:32:34.1728990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1729650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1730049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1730277Z 
2025-05-07T20:32:34.1730486Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec15b4340>
2025-05-07T20:32:34.1731612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1732979Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164fac0>}
2025-05-07T20:32:34.1734308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1735343Z context = <triton._C.libtriton.ir.context object at 0x7fcec1367030>
2025-05-07T20:32:34.1735626Z 
2025-05-07T20:32:34.1735793Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1736315Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1736784Z                            module_map=module_map)
2025-05-07T20:32:34.1737147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1737496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1737756Z E       ^
2025-05-07T20:32:34.1738224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1738668Z 
2025-05-07T20:32:34.1739089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1739603Z 
2025-05-07T20:32:34.1739709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1740199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1740622Z     T=1,
2025-05-07T20:32:34.1740826Z     D=7168,
2025-05-07T20:32:34.1741024Z     scale_ub=1200.0,
2025-05-07T20:32:34.1741257Z     contiguous=False,
2025-05-07T20:32:34.1741477Z     compiled=False,
2025-05-07T20:32:34.1741727Z )
2025-05-07T20:32:34.1742050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1742527Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.1742794Z 
2025-05-07T20:32:34.1742872Z     @given(
2025-05-07T20:32:34.1743101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1743404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1743709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1744032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1744408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1744686Z     )
2025-05-07T20:32:34.1745077Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1745520Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1745754Z         self,
2025-05-07T20:32:34.1745947Z         T: int,
2025-05-07T20:32:34.1746142Z         D: int,
2025-05-07T20:32:34.1746362Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1746635Z         contiguous: bool,
2025-05-07T20:32:34.1746873Z         compiled: bool,
2025-05-07T20:32:34.1747090Z     ) -> None:
2025-05-07T20:32:34.1747307Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1747546Z     
2025-05-07T20:32:34.1747810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1748146Z     
2025-05-07T20:32:34.1748343Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1748627Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1748933Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1749169Z         x0 = x[:, :D]
2025-05-07T20:32:34.1749385Z         x1 = x[:, D:]
2025-05-07T20:32:34.1749591Z     
2025-05-07T20:32:34.1749775Z         if contiguous:
2025-05-07T20:32:34.1750007Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1750265Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1750505Z     
2025-05-07T20:32:34.1750745Z         if scale_ub is not None:
2025-05-07T20:32:34.1751017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1751347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1751654Z             )
2025-05-07T20:32:34.1751841Z         else:
2025-05-07T20:32:34.1752056Z             scale_ub_tensor = None
2025-05-07T20:32:34.1752307Z     
2025-05-07T20:32:34.1752548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1752857Z             op = silu_mul_quant
2025-05-07T20:32:34.1753112Z             if compiled:
2025-05-07T20:32:34.1753367Z                 op = torch.compile(op)
2025-05-07T20:32:34.1753660Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1753930Z     
2025-05-07T20:32:34.1754128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1754292Z 
2025-05-07T20:32:34.1754393Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1754688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1755017Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1755297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1755967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1756649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1757184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1757852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1758519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1759046Z     kernel = self.compile(
2025-05-07T20:32:34.1759579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1760230Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1760680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1760902Z 
2025-05-07T20:32:34.1761112Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1165990>
2025-05-07T20:32:34.1762174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1763646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a44c0>}
2025-05-07T20:32:34.1764984Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1766018Z context = <triton._C.libtriton.ir.context object at 0x7fcec1176630>
2025-05-07T20:32:34.1766303Z 
2025-05-07T20:32:34.1766475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1766984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1767445Z                            module_map=module_map)
2025-05-07T20:32:34.1767808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1768157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1768409Z E       ^
2025-05-07T20:32:34.1768872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1769313Z 
2025-05-07T20:32:34.1769727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1770240Z 
2025-05-07T20:32:34.3695502Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.3696109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.3696504Z     T=4096,
2025-05-07T20:32:34.3696700Z     D=7168,
2025-05-07T20:32:34.3696897Z     scale_ub=1200.0,
2025-05-07T20:32:34.3697118Z     contiguous=False,
2025-05-07T20:32:34.3697346Z     compiled=True,
2025-05-07T20:32:34.3697554Z )
2025-05-07T20:32:34.3697870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.3698378Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.3698659Z 
2025-05-07T20:32:34.3698745Z     @given(
2025-05-07T20:32:34.3698977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.3699292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.3699603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.3700030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.3700363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.3700685Z     )
2025-05-07T20:32:34.3701063Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.3701500Z     def test_silu_mul_quant(
2025-05-07T20:32:34.3701749Z         self,
2025-05-07T20:32:34.3701945Z         T: int,
2025-05-07T20:32:34.3702145Z         D: int,
2025-05-07T20:32:34.3702375Z         scale_ub: Optional[float],
2025-05-07T20:32:34.3702651Z         contiguous: bool,
2025-05-07T20:32:34.3702892Z         compiled: bool,
2025-05-07T20:32:34.3703120Z     ) -> None:
2025-05-07T20:32:34.3703343Z         torch.manual_seed(2025)
2025-05-07T20:32:34.3703587Z     
2025-05-07T20:32:34.3703873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.3704216Z     
2025-05-07T20:32:34.3704416Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.3704701Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.3705015Z         x = x_sign * x_clamp
2025-05-07T20:32:34.3705325Z         x0 = x[:, :D]
2025-05-07T20:32:34.3705540Z         x1 = x[:, D:]
2025-05-07T20:32:34.3705748Z     
2025-05-07T20:32:34.3705937Z         if contiguous:
2025-05-07T20:32:34.3706169Z             x0 = x0.contiguous()
2025-05-07T20:32:34.3706430Z             x1 = x1.contiguous()
2025-05-07T20:32:34.3706671Z     
2025-05-07T20:32:34.3706863Z         if scale_ub is not None:
2025-05-07T20:32:34.3707143Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.3707481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.3707851Z             )
2025-05-07T20:32:34.3708044Z         else:
2025-05-07T20:32:34.3708256Z             scale_ub_tensor = None
2025-05-07T20:32:34.3708554Z     
2025-05-07T20:32:34.3708788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.3709102Z             op = silu_mul_quant
2025-05-07T20:32:34.3709347Z             if compiled:
2025-05-07T20:32:34.3709604Z                 op = torch.compile(op)
2025-05-07T20:32:34.3709906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3710172Z     
2025-05-07T20:32:34.3710356Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.3710530Z 
2025-05-07T20:32:34.3710633Z moe/activation_test.py:117: 
2025-05-07T20:32:34.3710932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3711254Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.3711536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3712098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.3712652Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.3713308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.3713996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.3714528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.3715251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.3715912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.3716442Z     kernel = self.compile(
2025-05-07T20:32:34.3716984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.3717627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.3718027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3718249Z 
2025-05-07T20:32:34.3718461Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1166d70>
2025-05-07T20:32:34.3719527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.3720894Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a51b0>}
2025-05-07T20:32:34.3722222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.3723469Z context = <triton._C.libtriton.ir.context object at 0x7fcec115b0f0>
2025-05-07T20:32:34.3723819Z 
2025-05-07T20:32:34.3723998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.3724513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.3724982Z                            module_map=module_map)
2025-05-07T20:32:34.3725353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.3725774Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.3726038Z E       ^
2025-05-07T20:32:34.3726500Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.3726939Z 
2025-05-07T20:32:34.3727354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.3727867Z 
2025-05-07T20:32:34.3727974Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.3728428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.3728824Z     T=128,
2025-05-07T20:32:34.3729049Z     D=7168,
2025-05-07T20:32:34.3729239Z     scale_ub=1200.0,
2025-05-07T20:32:34.3729469Z     contiguous=False,
2025-05-07T20:32:34.3729695Z     compiled=True,
2025-05-07T20:32:34.3729898Z )
2025-05-07T20:32:34.4767869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4768455Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.4768725Z 
2025-05-07T20:32:34.4768814Z     @given(
2025-05-07T20:32:34.4769045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4769363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4769674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4770010Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4770339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4770638Z     )
2025-05-07T20:32:34.4770998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4771437Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4771684Z         self,
2025-05-07T20:32:34.4771888Z         T: int,
2025-05-07T20:32:34.4772084Z         D: int,
2025-05-07T20:32:34.4772311Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4772590Z         contiguous: bool,
2025-05-07T20:32:34.4772953Z         compiled: bool,
2025-05-07T20:32:34.4773186Z     ) -> None:
2025-05-07T20:32:34.4773407Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4773652Z     
2025-05-07T20:32:34.4773942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4774286Z     
2025-05-07T20:32:34.4774483Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.4774771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.4775081Z         x = x_sign * x_clamp
2025-05-07T20:32:34.4775321Z         x0 = x[:, :D]
2025-05-07T20:32:34.4775533Z         x1 = x[:, D:]
2025-05-07T20:32:34.4775743Z     
2025-05-07T20:32:34.4775935Z         if contiguous:
2025-05-07T20:32:34.4776169Z             x0 = x0.contiguous()
2025-05-07T20:32:34.4776433Z             x1 = x1.contiguous()
2025-05-07T20:32:34.4776675Z     
2025-05-07T20:32:34.4776862Z         if scale_ub is not None:
2025-05-07T20:32:34.4777138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.4777491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.4777793Z             )
2025-05-07T20:32:34.4777982Z         else:
2025-05-07T20:32:34.4778194Z             scale_ub_tensor = None
2025-05-07T20:32:34.4778433Z     
2025-05-07T20:32:34.4778671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.4778983Z             op = silu_mul_quant
2025-05-07T20:32:34.4779232Z             if compiled:
2025-05-07T20:32:34.4779487Z                 op = torch.compile(op)
2025-05-07T20:32:34.4779852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4780134Z     
2025-05-07T20:32:34.4780324Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.4780496Z 
2025-05-07T20:32:34.4780600Z moe/activation_test.py:117: 
2025-05-07T20:32:34.4780899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4781229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.4781512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4782150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.4782705Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.4783365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.4784072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.4784611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.4785357Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.4786073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.4786616Z     kernel = self.compile(
2025-05-07T20:32:34.4787157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.4787824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4788229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4788453Z 
2025-05-07T20:32:34.4788669Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1163ac0>
2025-05-07T20:32:34.4789749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.4791455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a40d0>}
2025-05-07T20:32:34.4792912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.4793969Z context = <triton._C.libtriton.ir.context object at 0x7fcec1429670>
2025-05-07T20:32:34.4794261Z 
2025-05-07T20:32:34.4794435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.4794964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4795447Z                            module_map=module_map)
2025-05-07T20:32:34.4803739Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4804120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4804387Z E       ^
2025-05-07T20:32:34.4804861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4805310Z 
2025-05-07T20:32:34.4805734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4806257Z 
2025-05-07T20:32:34.4806374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4806788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4807202Z     T=2048,
2025-05-07T20:32:34.4807404Z     D=7168,
2025-05-07T20:32:34.4807605Z     scale_ub=None,
2025-05-07T20:32:34.4807831Z     contiguous=True,
2025-05-07T20:32:34.4808068Z     compiled=True,
2025-05-07T20:32:34.4808278Z )
2025-05-07T20:32:34.4808614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4809118Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.4809384Z 
2025-05-07T20:32:34.4809465Z     @given(
2025-05-07T20:32:34.4809711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4810033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4810348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4810705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4811180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4811476Z     )
2025-05-07T20:32:34.4811827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4812283Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4812530Z         self,
2025-05-07T20:32:34.4812727Z         T: int,
2025-05-07T20:32:34.4812934Z         D: int,
2025-05-07T20:32:34.4813161Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4813429Z         contiguous: bool,
2025-05-07T20:32:34.4813745Z         compiled: bool,
2025-05-07T20:32:34.4813980Z     ) -> None:
2025-05-07T20:32:34.4814200Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4814504Z     
2025-05-07T20:32:34.4814807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4815160Z     
2025-05-07T20:32:34.4815358Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.4815658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.4815976Z         x = x_sign * x_clamp
2025-05-07T20:32:34.4816217Z         x0 = x[:, :D]
2025-05-07T20:32:34.4816439Z         x1 = x[:, D:]
2025-05-07T20:32:34.4816652Z     
2025-05-07T20:32:34.4816836Z         if contiguous:
2025-05-07T20:32:34.4817071Z             x0 = x0.contiguous()
2025-05-07T20:32:34.4817330Z             x1 = x1.contiguous()
2025-05-07T20:32:34.4817572Z     
2025-05-07T20:32:34.4817766Z         if scale_ub is not None:
2025-05-07T20:32:34.4818041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.4818382Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.4818694Z             )
2025-05-07T20:32:34.4818892Z         else:
2025-05-07T20:32:34.4819116Z             scale_ub_tensor = None
2025-05-07T20:32:34.4819368Z     
2025-05-07T20:32:34.4819605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.4820006Z             op = silu_mul_quant
2025-05-07T20:32:34.4820248Z             if compiled:
2025-05-07T20:32:34.4820572Z                 op = torch.compile(op)
2025-05-07T20:32:34.4820872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4821138Z     
2025-05-07T20:32:34.4821336Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.4821500Z 
2025-05-07T20:32:34.4821605Z moe/activation_test.py:117: 
2025-05-07T20:32:34.4821901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4822233Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.4822523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4823082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.4823639Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.4824296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.4824986Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.4825536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.4826218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.4826873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.4827403Z     kernel = self.compile(
2025-05-07T20:32:34.4827942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.4828603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4828991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4829222Z 
2025-05-07T20:32:34.4829433Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec149bac0>
2025-05-07T20:32:34.4830505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.4831916Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a6560>}
2025-05-07T20:32:34.4833271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.4834338Z context = <triton._C.libtriton.ir.context object at 0x7fcec1478630>
2025-05-07T20:32:34.4834629Z 
2025-05-07T20:32:34.4834832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.4835352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4835830Z                            module_map=module_map)
2025-05-07T20:32:34.4836197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4836553Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4836806Z E       ^
2025-05-07T20:32:34.4837265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4837713Z 
2025-05-07T20:32:34.4838130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4838639Z 
2025-05-07T20:32:34.5630492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5630962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5631415Z     T=16384,
2025-05-07T20:32:34.5631642Z     D=5120,
2025-05-07T20:32:34.5631847Z     scale_ub=None,
2025-05-07T20:32:34.5632061Z     contiguous=False,
2025-05-07T20:32:34.5632284Z     compiled=False,
2025-05-07T20:32:34.5632489Z )
2025-05-07T20:32:34.5632917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5633410Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.5633688Z 
2025-05-07T20:32:34.5633767Z     @given(
2025-05-07T20:32:34.5634003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5634305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5634621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5634947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5635276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5635558Z     )
2025-05-07T20:32:34.5635917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5636358Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5636593Z         self,
2025-05-07T20:32:34.5636789Z         T: int,
2025-05-07T20:32:34.5636989Z         D: int,
2025-05-07T20:32:34.5637206Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5637479Z         contiguous: bool,
2025-05-07T20:32:34.5637721Z         compiled: bool,
2025-05-07T20:32:34.5637948Z     ) -> None:
2025-05-07T20:32:34.5638173Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5638415Z     
2025-05-07T20:32:34.5638687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5639028Z     
2025-05-07T20:32:34.5639224Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5639518Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5641524Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.5643449Z 
2025-05-07T20:32:34.5643566Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.5643782Z 
2025-05-07T20:32:34.5643884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5644295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5644688Z     T=4096,
2025-05-07T20:32:34.5644872Z     D=7168,
2025-05-07T20:32:34.5645071Z     scale_ub=1200.0,
2025-05-07T20:32:34.5645350Z     contiguous=True,
2025-05-07T20:32:34.5645567Z     compiled=True,
2025-05-07T20:32:34.5645770Z )
2025-05-07T20:32:34.5646148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5646637Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.5646914Z 
2025-05-07T20:32:34.5646989Z     @given(
2025-05-07T20:32:34.5647226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5647536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5647847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5648181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5648502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5648782Z     )
2025-05-07T20:32:34.5649133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5649572Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5649807Z         self,
2025-05-07T20:32:34.5650002Z         T: int,
2025-05-07T20:32:34.5650200Z         D: int,
2025-05-07T20:32:34.5650417Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5650690Z         contiguous: bool,
2025-05-07T20:32:34.5650930Z         compiled: bool,
2025-05-07T20:32:34.5651149Z     ) -> None:
2025-05-07T20:32:34.5651370Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5651621Z     
2025-05-07T20:32:34.5651935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5652271Z     
2025-05-07T20:32:34.5652466Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5652752Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5654739Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.5656593Z 
2025-05-07T20:32:34.5656712Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.5656927Z 
2025-05-07T20:32:34.5657030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5657446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5657832Z     T=16384,
2025-05-07T20:32:34.5658023Z     D=7168,
2025-05-07T20:32:34.5658211Z     scale_ub=None,
2025-05-07T20:32:34.5658421Z     contiguous=False,
2025-05-07T20:32:34.5658647Z     compiled=False,
2025-05-07T20:32:34.5658851Z )
2025-05-07T20:32:34.5659159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5659661Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.5660040Z 
2025-05-07T20:32:34.5660118Z     @given(
2025-05-07T20:32:34.5660347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5660657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5660973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5661297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5661674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5661963Z     )
2025-05-07T20:32:34.5662308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5662739Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5662978Z         self,
2025-05-07T20:32:34.5663174Z         T: int,
2025-05-07T20:32:34.5663364Z         D: int,
2025-05-07T20:32:34.5663584Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5663855Z         contiguous: bool,
2025-05-07T20:32:34.5664094Z         compiled: bool,
2025-05-07T20:32:34.5664356Z     ) -> None:
2025-05-07T20:32:34.5664570Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5664812Z     
2025-05-07T20:32:34.5665147Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5667184Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.5669034Z 
2025-05-07T20:32:34.5669152Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:34.5669363Z 
2025-05-07T20:32:34.5669464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5669873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5670259Z     T=2048,
2025-05-07T20:32:34.5670441Z     D=7168,
2025-05-07T20:32:34.5670630Z     scale_ub=1200.0,
2025-05-07T20:32:34.5670843Z     contiguous=True,
2025-05-07T20:32:34.5671059Z     compiled=True,
2025-05-07T20:32:34.5671264Z )
2025-05-07T20:32:34.5671613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5672112Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.5672379Z 
2025-05-07T20:32:34.5672455Z     @given(
2025-05-07T20:32:34.5672679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5672980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5673280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5673607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5673927Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5674214Z     )
2025-05-07T20:32:34.5674576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5675014Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5675246Z         self,
2025-05-07T20:32:34.5675434Z         T: int,
2025-05-07T20:32:34.5675626Z         D: int,
2025-05-07T20:32:34.5675837Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5676110Z         contiguous: bool,
2025-05-07T20:32:34.5676347Z         compiled: bool,
2025-05-07T20:32:34.5676558Z     ) -> None:
2025-05-07T20:32:34.5676776Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5677012Z     
2025-05-07T20:32:34.5677274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5677609Z     
2025-05-07T20:32:34.5677798Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5678079Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5680047Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.5681922Z 
2025-05-07T20:32:34.5682040Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:34.5682256Z 
2025-05-07T20:32:34.5682360Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5682770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5683155Z     T=2048,
2025-05-07T20:32:34.5683340Z     D=7168,
2025-05-07T20:32:34.5683531Z     scale_ub=None,
2025-05-07T20:32:34.5683736Z     contiguous=True,
2025-05-07T20:32:34.5684003Z     compiled=False,
2025-05-07T20:32:34.5684208Z )
2025-05-07T20:32:34.8758174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8758727Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.8758998Z 
2025-05-07T20:32:34.8759079Z     @given(
2025-05-07T20:32:34.8759324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8759649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8759952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8760289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8760623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8760900Z     )
2025-05-07T20:32:34.8761255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8761699Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8761941Z         self,
2025-05-07T20:32:34.8762138Z         T: int,
2025-05-07T20:32:34.8762334Z         D: int,
2025-05-07T20:32:34.8762559Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8762821Z         contiguous: bool,
2025-05-07T20:32:34.8763052Z         compiled: bool,
2025-05-07T20:32:34.8763276Z     ) -> None:
2025-05-07T20:32:34.8763490Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8763736Z     
2025-05-07T20:32:34.8764011Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8764419Z     
2025-05-07T20:32:34.8764625Z >       x_sign = torch.sign(x)
2025-05-07T20:32:34.8766597Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:34.8768474Z 
2025-05-07T20:32:34.8768599Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:34.8768812Z 
2025-05-07T20:32:34.8768922Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8769327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8769735Z     T=1,
2025-05-07T20:32:34.8769916Z     D=7168,
2025-05-07T20:32:34.8770111Z     scale_ub=1200.0,
2025-05-07T20:32:34.8770338Z     contiguous=True,
2025-05-07T20:32:34.8770562Z     compiled=False,
2025-05-07T20:32:34.8770763Z )
2025-05-07T20:32:34.8771074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8771559Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:34.8771817Z 
2025-05-07T20:32:34.8771895Z     @given(
2025-05-07T20:32:34.8772138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8772446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8772751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8773075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8773408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8773687Z     )
2025-05-07T20:32:34.8774168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8774604Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8774844Z         self,
2025-05-07T20:32:34.8775037Z         T: int,
2025-05-07T20:32:34.8775236Z         D: int,
2025-05-07T20:32:34.8775467Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8775734Z         contiguous: bool,
2025-05-07T20:32:34.8775971Z         compiled: bool,
2025-05-07T20:32:34.8776192Z     ) -> None:
2025-05-07T20:32:34.8776399Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8776698Z     
2025-05-07T20:32:34.8776968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8777298Z     
2025-05-07T20:32:34.8777525Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.8777819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.8778125Z         x = x_sign * x_clamp
2025-05-07T20:32:34.8778365Z         x0 = x[:, :D]
2025-05-07T20:32:34.8778590Z         x1 = x[:, D:]
2025-05-07T20:32:34.8778807Z     
2025-05-07T20:32:34.8778992Z         if contiguous:
2025-05-07T20:32:34.8779225Z             x0 = x0.contiguous()
2025-05-07T20:32:34.8779483Z             x1 = x1.contiguous()
2025-05-07T20:32:34.8779708Z     
2025-05-07T20:32:34.8779968Z         if scale_ub is not None:
2025-05-07T20:32:34.8780238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.8780567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.8780894Z             )
2025-05-07T20:32:34.8781114Z         else:
2025-05-07T20:32:34.8781327Z             scale_ub_tensor = None
2025-05-07T20:32:34.8781580Z     
2025-05-07T20:32:34.8781814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.8782125Z             op = silu_mul_quant
2025-05-07T20:32:34.8782372Z             if compiled:
2025-05-07T20:32:34.8782622Z                 op = torch.compile(op)
2025-05-07T20:32:34.8782917Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8783182Z     
2025-05-07T20:32:34.8783421Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.8783588Z 
2025-05-07T20:32:34.8783693Z moe/activation_test.py:117: 
2025-05-07T20:32:34.8783980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8784310Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.8784591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8785276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.8785965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.8786495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.8787173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.8787824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.8788355Z     kernel = self.compile(
2025-05-07T20:32:34.8788891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.8789541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.8790170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8790406Z 
2025-05-07T20:32:34.8790617Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec148b520>
2025-05-07T20:32:34.8791741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.8793101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f644c0>}
2025-05-07T20:32:34.8794511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.8795537Z context = <triton._C.libtriton.ir.context object at 0x7fcec0fb65f0>
2025-05-07T20:32:34.8795826Z 
2025-05-07T20:32:34.8795993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.8796502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.8797028Z                            module_map=module_map)
2025-05-07T20:32:34.8797443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.8797791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.8798041Z E       ^
2025-05-07T20:32:34.8798497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.8798944Z 
2025-05-07T20:32:34.8799359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8799861Z 
2025-05-07T20:32:34.8799967Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8800366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8800761Z     T=128,
2025-05-07T20:32:34.8800943Z     D=5120,
2025-05-07T20:32:34.8801127Z     scale_ub=None,
2025-05-07T20:32:34.8801334Z     contiguous=True,
2025-05-07T20:32:34.8801552Z     compiled=False,
2025-05-07T20:32:34.8801761Z )
2025-05-07T20:32:34.9583794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9584364Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.9584635Z 
2025-05-07T20:32:34.9584713Z     @given(
2025-05-07T20:32:34.9584937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9585249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9585678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9586015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9586341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9586636Z     )
2025-05-07T20:32:34.9586985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9587419Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9587660Z         self,
2025-05-07T20:32:34.9587859Z         T: int,
2025-05-07T20:32:34.9588058Z         D: int,
2025-05-07T20:32:34.9588283Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9588582Z         contiguous: bool,
2025-05-07T20:32:34.9588829Z         compiled: bool,
2025-05-07T20:32:34.9589058Z     ) -> None:
2025-05-07T20:32:34.9589271Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9589518Z     
2025-05-07T20:32:34.9589791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9590449Z     
2025-05-07T20:32:34.9596490Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.9596831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.9597154Z         x = x_sign * x_clamp
2025-05-07T20:32:34.9597393Z         x0 = x[:, :D]
2025-05-07T20:32:34.9597614Z         x1 = x[:, D:]
2025-05-07T20:32:34.9597820Z     
2025-05-07T20:32:34.9598003Z         if contiguous:
2025-05-07T20:32:34.9598237Z             x0 = x0.contiguous()
2025-05-07T20:32:34.9598498Z             x1 = x1.contiguous()
2025-05-07T20:32:34.9598741Z     
2025-05-07T20:32:34.9598938Z         if scale_ub is not None:
2025-05-07T20:32:34.9599214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.9599550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.9599863Z             )
2025-05-07T20:32:34.9600057Z         else:
2025-05-07T20:32:34.9600270Z             scale_ub_tensor = None
2025-05-07T20:32:34.9600515Z     
2025-05-07T20:32:34.9600886Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.9601242Z             op = silu_mul_quant
2025-05-07T20:32:34.9601491Z             if compiled:
2025-05-07T20:32:34.9601744Z                 op = torch.compile(op)
2025-05-07T20:32:34.9602046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9602319Z     
2025-05-07T20:32:34.9602517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.9602682Z 
2025-05-07T20:32:34.9602793Z moe/activation_test.py:117: 
2025-05-07T20:32:34.9603083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9603489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.9603775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9604517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.9605199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.9605745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.9606428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.9607085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.9607613Z     kernel = self.compile(
2025-05-07T20:32:34.9608163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.9608824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.9609219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9609451Z 
2025-05-07T20:32:34.9609657Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0f317e0>
2025-05-07T20:32:34.9610815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.9612205Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f64940>}
2025-05-07T20:32:34.9613535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.9614568Z context = <triton._C.libtriton.ir.context object at 0x7fcec0f5e2f0>
2025-05-07T20:32:34.9614856Z 
2025-05-07T20:32:34.9615028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.9615552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.9616011Z                            module_map=module_map)
2025-05-07T20:32:34.9616385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.9616741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.9616996Z E       ^
2025-05-07T20:32:34.9617459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.9617908Z 
2025-05-07T20:32:34.9618321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.9618828Z 
2025-05-07T20:32:34.9618941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9619354Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9619826Z     T=128,
2025-05-07T20:32:34.9620020Z     D=7168,
2025-05-07T20:32:34.9620213Z     scale_ub=None,
2025-05-07T20:32:34.9620424Z     contiguous=True,
2025-05-07T20:32:34.9620646Z     compiled=False,
2025-05-07T20:32:34.9620851Z )
2025-05-07T20:32:34.9621221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9621765Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.9622031Z 
2025-05-07T20:32:34.9622116Z     @given(
2025-05-07T20:32:34.9622341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9622658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9622965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9623289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9623616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9623956Z     )
2025-05-07T20:32:34.9624340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9624787Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9625032Z         self,
2025-05-07T20:32:34.9625226Z         T: int,
2025-05-07T20:32:34.9625425Z         D: int,
2025-05-07T20:32:34.9625647Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9625917Z         contiguous: bool,
2025-05-07T20:32:34.9626159Z         compiled: bool,
2025-05-07T20:32:34.9626383Z     ) -> None:
2025-05-07T20:32:34.9626593Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9626830Z     
2025-05-07T20:32:34.9627102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9627441Z     
2025-05-07T20:32:34.9627642Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.9627928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.9628235Z         x = x_sign * x_clamp
2025-05-07T20:32:34.9628479Z         x0 = x[:, :D]
2025-05-07T20:32:34.9628696Z         x1 = x[:, D:]
2025-05-07T20:32:34.9628897Z     
2025-05-07T20:32:34.9629087Z         if contiguous:
2025-05-07T20:32:34.9629319Z             x0 = x0.contiguous()
2025-05-07T20:32:34.9629569Z             x1 = x1.contiguous()
2025-05-07T20:32:34.9629802Z     
2025-05-07T20:32:34.9630000Z         if scale_ub is not None:
2025-05-07T20:32:34.9630275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.9630651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.9630962Z             )
2025-05-07T20:32:34.9631184Z         else:
2025-05-07T20:32:34.9631415Z             scale_ub_tensor = None
2025-05-07T20:32:34.9631665Z     
2025-05-07T20:32:34.9631896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.9632203Z             op = silu_mul_quant
2025-05-07T20:32:34.9632457Z             if compiled:
2025-05-07T20:32:34.9632708Z                 op = torch.compile(op)
2025-05-07T20:32:34.9633003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9633282Z     
2025-05-07T20:32:34.9633478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.9633644Z 
2025-05-07T20:32:34.9633743Z moe/activation_test.py:117: 
2025-05-07T20:32:34.9634037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9634366Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.9634650Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9635330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.9636013Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.9636551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.9637224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.9637888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.9638415Z     kernel = self.compile(
2025-05-07T20:32:34.9638955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.9639603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.9640000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9640274Z 
2025-05-07T20:32:34.9640483Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0f46da0>
2025-05-07T20:32:34.9641598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.9642945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f65240>}
2025-05-07T20:32:34.9644354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.9645384Z context = <triton._C.libtriton.ir.context object at 0x7fcec1019830>
2025-05-07T20:32:34.9645670Z 
2025-05-07T20:32:34.9645844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.9646362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.9646827Z                            module_map=module_map)
2025-05-07T20:32:34.9647193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.9647543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.9647801Z E       ^
2025-05-07T20:32:34.9648267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.9648715Z 
2025-05-07T20:32:34.9649136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.9649654Z 
2025-05-07T20:32:34.9649767Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9650177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9650618Z     T=2048,
2025-05-07T20:32:34.9650833Z     D=7168,
2025-05-07T20:32:34.9651054Z     scale_ub=1200.0,
2025-05-07T20:32:34.9651273Z     contiguous=True,
2025-05-07T20:32:34.9651496Z     compiled=False,
2025-05-07T20:32:34.9651695Z )
2025-05-07T20:32:35.0626864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0627378Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.0627655Z 
2025-05-07T20:32:35.0627772Z     @given(
2025-05-07T20:32:35.0628022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0628333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0628648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0628985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0629311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0629605Z     )
2025-05-07T20:32:35.0629966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0630407Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0630648Z         self,
2025-05-07T20:32:35.0630843Z         T: int,
2025-05-07T20:32:35.0631041Z         D: int,
2025-05-07T20:32:35.0631264Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0631540Z         contiguous: bool,
2025-05-07T20:32:35.0631787Z         compiled: bool,
2025-05-07T20:32:35.0632013Z     ) -> None:
2025-05-07T20:32:35.0632229Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0632489Z     
2025-05-07T20:32:35.0632758Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0634797Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0636749Z 
2025-05-07T20:32:35.0636872Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.0637094Z 
2025-05-07T20:32:35.0637197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0637610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0638081Z     T=1,
2025-05-07T20:32:35.0638275Z     D=5120,
2025-05-07T20:32:35.0638476Z     scale_ub=1200.0,
2025-05-07T20:32:35.0638791Z     contiguous=True,
2025-05-07T20:32:35.0639018Z     compiled=False,
2025-05-07T20:32:35.0639231Z )
2025-05-07T20:32:35.0639545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0640022Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.0640295Z 
2025-05-07T20:32:35.0640375Z     @given(
2025-05-07T20:32:35.0640606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0640934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0641271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0641604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0641929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0642211Z     )
2025-05-07T20:32:35.0642578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0643023Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0643261Z         self,
2025-05-07T20:32:35.0643462Z         T: int,
2025-05-07T20:32:35.0643661Z         D: int,
2025-05-07T20:32:35.0643878Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0644159Z         contiguous: bool,
2025-05-07T20:32:35.0644399Z         compiled: bool,
2025-05-07T20:32:35.0644620Z     ) -> None:
2025-05-07T20:32:35.0644909Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0645151Z     
2025-05-07T20:32:35.0645416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0645754Z     
2025-05-07T20:32:35.0645955Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.0646243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.0646555Z         x = x_sign * x_clamp
2025-05-07T20:32:35.0646799Z         x0 = x[:, :D]
2025-05-07T20:32:35.0647019Z         x1 = x[:, D:]
2025-05-07T20:32:35.0647223Z     
2025-05-07T20:32:35.0647407Z         if contiguous:
2025-05-07T20:32:35.0647637Z             x0 = x0.contiguous()
2025-05-07T20:32:35.0647899Z             x1 = x1.contiguous()
2025-05-07T20:32:35.0648140Z     
2025-05-07T20:32:35.0648345Z         if scale_ub is not None:
2025-05-07T20:32:35.0648615Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.0648956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.0649272Z             )
2025-05-07T20:32:35.0649463Z         else:
2025-05-07T20:32:35.0649682Z             scale_ub_tensor = None
2025-05-07T20:32:35.0649934Z     
2025-05-07T20:32:35.0650162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.0650479Z             op = silu_mul_quant
2025-05-07T20:32:35.0650731Z             if compiled:
2025-05-07T20:32:35.0650999Z                 op = torch.compile(op)
2025-05-07T20:32:35.0651317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.0651589Z     
2025-05-07T20:32:35.0651783Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.0651946Z 
2025-05-07T20:32:35.0652051Z moe/activation_test.py:117: 
2025-05-07T20:32:35.0652350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.0652678Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.0652955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.0653649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.0654389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.0654933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.0655606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.0656270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.0656844Z     kernel = self.compile(
2025-05-07T20:32:35.0657421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.0658072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.0658474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.0658699Z 
2025-05-07T20:32:35.0658920Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0f46320>
2025-05-07T20:32:35.0660078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.0661481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f66200>}
2025-05-07T20:32:35.0662815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.0663845Z context = <triton._C.libtriton.ir.context object at 0x7fcec1007af0>
2025-05-07T20:32:35.0664131Z 
2025-05-07T20:32:35.0664304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.0664863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.0665326Z                            module_map=module_map)
2025-05-07T20:32:35.0665693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.0666037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.0666297Z E       ^
2025-05-07T20:32:35.0666764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.0667204Z 
2025-05-07T20:32:35.0667623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.0668130Z 
2025-05-07T20:32:35.0668235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0668641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0669034Z     T=2048,
2025-05-07T20:32:35.0669213Z     D=5120,
2025-05-07T20:32:35.0669400Z     scale_ub=None,
2025-05-07T20:32:35.0669613Z     contiguous=True,
2025-05-07T20:32:35.0669830Z     compiled=False,
2025-05-07T20:32:35.0670030Z )
2025-05-07T20:32:35.0670346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.0670836Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.0671107Z 
2025-05-07T20:32:35.0671182Z     @given(
2025-05-07T20:32:35.0671414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.0671722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.0672026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.0672349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.0672674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.0672946Z     )
2025-05-07T20:32:35.0673289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.0673725Z     def test_silu_mul_quant(
2025-05-07T20:32:35.0674012Z         self,
2025-05-07T20:32:35.0674204Z         T: int,
2025-05-07T20:32:35.0674396Z         D: int,
2025-05-07T20:32:35.0674608Z         scale_ub: Optional[float],
2025-05-07T20:32:35.0674873Z         contiguous: bool,
2025-05-07T20:32:35.0675108Z         compiled: bool,
2025-05-07T20:32:35.0675328Z     ) -> None:
2025-05-07T20:32:35.0675536Z         torch.manual_seed(2025)
2025-05-07T20:32:35.0675772Z     
2025-05-07T20:32:35.0676043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.0676424Z     
2025-05-07T20:32:35.0676619Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.0678588Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.0680456Z 
2025-05-07T20:32:35.0680581Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.0680793Z 
2025-05-07T20:32:35.0680902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.0681306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.0681702Z     T=16384,
2025-05-07T20:32:35.0681894Z     D=5120,
2025-05-07T20:32:35.0682082Z     scale_ub=None,
2025-05-07T20:32:35.0682287Z     contiguous=True,
2025-05-07T20:32:35.0682509Z     compiled=False,
2025-05-07T20:32:35.0682702Z )
2025-05-07T20:32:35.1657826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1658412Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.1658704Z 
2025-05-07T20:32:35.1658897Z     @given(
2025-05-07T20:32:35.1659138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1659447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1659871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1660213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1660537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1660818Z     )
2025-05-07T20:32:35.1661159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1661595Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1661826Z         self,
2025-05-07T20:32:35.1662021Z         T: int,
2025-05-07T20:32:35.1662223Z         D: int,
2025-05-07T20:32:35.1662439Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1662713Z         contiguous: bool,
2025-05-07T20:32:35.1662961Z         compiled: bool,
2025-05-07T20:32:35.1663183Z     ) -> None:
2025-05-07T20:32:35.1663419Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1663678Z     
2025-05-07T20:32:35.1663945Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1665996Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.1667847Z 
2025-05-07T20:32:35.1667967Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.1668184Z 
2025-05-07T20:32:35.1668293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1668789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1669190Z     T=4096,
2025-05-07T20:32:35.1669385Z     D=5120,
2025-05-07T20:32:35.1669587Z     scale_ub=None,
2025-05-07T20:32:35.1669799Z     contiguous=True,
2025-05-07T20:32:35.1670028Z     compiled=False,
2025-05-07T20:32:35.1670235Z )
2025-05-07T20:32:35.1670544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1671040Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.1671317Z 
2025-05-07T20:32:35.1671470Z     @given(
2025-05-07T20:32:35.1671703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1672908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1673235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1673565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1673891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1674187Z     )
2025-05-07T20:32:35.1674544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1674979Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1675227Z         self,
2025-05-07T20:32:35.1675421Z         T: int,
2025-05-07T20:32:35.1675613Z         D: int,
2025-05-07T20:32:35.1675827Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1676098Z         contiguous: bool,
2025-05-07T20:32:35.1676341Z         compiled: bool,
2025-05-07T20:32:35.1676558Z     ) -> None:
2025-05-07T20:32:35.1676776Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1677019Z     
2025-05-07T20:32:35.1677284Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1679342Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.1681210Z 
2025-05-07T20:32:35.1681328Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.1681537Z 
2025-05-07T20:32:35.1681644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1682058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1682450Z     T=2048,
2025-05-07T20:32:35.1682629Z     D=5120,
2025-05-07T20:32:35.1682820Z     scale_ub=None,
2025-05-07T20:32:35.1683030Z     contiguous=False,
2025-05-07T20:32:35.1683248Z     compiled=False,
2025-05-07T20:32:35.1683450Z )
2025-05-07T20:32:35.1683761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1684253Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.1684521Z 
2025-05-07T20:32:35.1684603Z     @given(
2025-05-07T20:32:35.1684829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1685135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1685444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1685772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1686093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1686381Z     )
2025-05-07T20:32:35.1686732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1687164Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1687408Z         self,
2025-05-07T20:32:35.1687607Z         T: int,
2025-05-07T20:32:35.1687801Z         D: int,
2025-05-07T20:32:35.1688019Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1688292Z         contiguous: bool,
2025-05-07T20:32:35.1688577Z         compiled: bool,
2025-05-07T20:32:35.1688806Z     ) -> None:
2025-05-07T20:32:35.1689023Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1689256Z     
2025-05-07T20:32:35.1689528Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1697402Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.1699313Z 
2025-05-07T20:32:35.1699440Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.1699669Z 
2025-05-07T20:32:35.1699851Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1700273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1700675Z     T=4096,
2025-05-07T20:32:35.1700864Z     D=7168,
2025-05-07T20:32:35.1701094Z     scale_ub=None,
2025-05-07T20:32:35.1701326Z     contiguous=True,
2025-05-07T20:32:35.1701554Z     compiled=True,
2025-05-07T20:32:35.1701765Z )
2025-05-07T20:32:35.1702082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1702580Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.1702858Z 
2025-05-07T20:32:35.1702948Z     @given(
2025-05-07T20:32:35.1703187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1703505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1703818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1704156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1704555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1704854Z     )
2025-05-07T20:32:35.1705213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1705655Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1705903Z         self,
2025-05-07T20:32:35.1706107Z         T: int,
2025-05-07T20:32:35.1706303Z         D: int,
2025-05-07T20:32:35.1706528Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1706806Z         contiguous: bool,
2025-05-07T20:32:35.1707046Z         compiled: bool,
2025-05-07T20:32:35.1707280Z     ) -> None:
2025-05-07T20:32:35.1707498Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1707744Z     
2025-05-07T20:32:35.1708026Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1710059Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.1711900Z 
2025-05-07T20:32:35.1712020Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.1712230Z 
2025-05-07T20:32:35.1712345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1712763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1713169Z     T=2048,
2025-05-07T20:32:35.1713363Z     D=5120,
2025-05-07T20:32:35.1713550Z     scale_ub=1200.0,
2025-05-07T20:32:35.1713780Z     contiguous=False,
2025-05-07T20:32:35.1714008Z     compiled=False,
2025-05-07T20:32:35.1714219Z )
2025-05-07T20:32:35.1714537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1715100Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.1715373Z 
2025-05-07T20:32:35.1715454Z     @given(
2025-05-07T20:32:35.1715681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1715991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1716299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1716623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1716951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1717288Z     )
2025-05-07T20:32:35.1717674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1718128Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1718375Z         self,
2025-05-07T20:32:35.1718588Z         T: int,
2025-05-07T20:32:35.1718782Z         D: int,
2025-05-07T20:32:35.1719004Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1719281Z         contiguous: bool,
2025-05-07T20:32:35.1719520Z         compiled: bool,
2025-05-07T20:32:35.1719756Z     ) -> None:
2025-05-07T20:32:35.1719974Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1720217Z     
2025-05-07T20:32:35.1720491Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1722580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.1724411Z 
2025-05-07T20:32:35.1724531Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.1724787Z 
2025-05-07T20:32:35.1724900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1725310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1725717Z     T=4096,
2025-05-07T20:32:35.1725911Z     D=7168,
2025-05-07T20:32:35.1726105Z     scale_ub=1200.0,
2025-05-07T20:32:35.1726328Z     contiguous=True,
2025-05-07T20:32:35.1726552Z     compiled=False,
2025-05-07T20:32:35.1726758Z )
2025-05-07T20:32:35.2987696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2988252Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.2988524Z 
2025-05-07T20:32:35.2988608Z     @given(
2025-05-07T20:32:35.2988842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2989152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2989459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2989792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2990262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2990655Z     )
2025-05-07T20:32:35.2991009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2991449Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2991690Z         self,
2025-05-07T20:32:35.2991880Z         T: int,
2025-05-07T20:32:35.2992083Z         D: int,
2025-05-07T20:32:35.2992299Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2992562Z         contiguous: bool,
2025-05-07T20:32:35.2992808Z         compiled: bool,
2025-05-07T20:32:35.2993032Z     ) -> None:
2025-05-07T20:32:35.2993249Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2993488Z     
2025-05-07T20:32:35.2993764Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2995798Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.2997753Z 
2025-05-07T20:32:35.2997881Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.2998157Z 
2025-05-07T20:32:35.2998260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2998729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2999131Z     T=16384,
2025-05-07T20:32:35.2999324Z     D=7168,
2025-05-07T20:32:35.2999513Z     scale_ub=None,
2025-05-07T20:32:35.2999732Z     contiguous=False,
2025-05-07T20:32:35.2999953Z     compiled=True,
2025-05-07T20:32:35.3000163Z )
2025-05-07T20:32:35.3000479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3000967Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3001243Z 
2025-05-07T20:32:35.3001318Z     @given(
2025-05-07T20:32:35.3001547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3001858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3002160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3002492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3002823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3003098Z     )
2025-05-07T20:32:35.3003460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3003902Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3004136Z         self,
2025-05-07T20:32:35.3004332Z         T: int,
2025-05-07T20:32:35.3004531Z         D: int,
2025-05-07T20:32:35.3004852Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3005129Z         contiguous: bool,
2025-05-07T20:32:35.3005369Z         compiled: bool,
2025-05-07T20:32:35.3005585Z     ) -> None:
2025-05-07T20:32:35.3005808Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3006049Z     
2025-05-07T20:32:35.3006322Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3008348Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3010194Z 
2025-05-07T20:32:35.3010318Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3010531Z 
2025-05-07T20:32:35.3010638Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3011054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3011479Z     T=4096,
2025-05-07T20:32:35.3011665Z     D=7168,
2025-05-07T20:32:35.3011855Z     scale_ub=None,
2025-05-07T20:32:35.3012064Z     contiguous=True,
2025-05-07T20:32:35.3012282Z     compiled=False,
2025-05-07T20:32:35.3012489Z )
2025-05-07T20:32:35.3012815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3013307Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3013570Z 
2025-05-07T20:32:35.3013644Z     @given(
2025-05-07T20:32:35.3013868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3014184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3014543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3015008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3015346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3015628Z     )
2025-05-07T20:32:35.3015981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3016425Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3016665Z         self,
2025-05-07T20:32:35.3016859Z         T: int,
2025-05-07T20:32:35.3017051Z         D: int,
2025-05-07T20:32:35.3017320Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3017587Z         contiguous: bool,
2025-05-07T20:32:35.3017830Z         compiled: bool,
2025-05-07T20:32:35.3018089Z     ) -> None:
2025-05-07T20:32:35.3018302Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3018542Z     
2025-05-07T20:32:35.3018815Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3020950Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3022796Z 
2025-05-07T20:32:35.3022914Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3023128Z 
2025-05-07T20:32:35.3023235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3023649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3024043Z     T=16384,
2025-05-07T20:32:35.3024227Z     D=7168,
2025-05-07T20:32:35.3024417Z     scale_ub=None,
2025-05-07T20:32:35.3024639Z     contiguous=True,
2025-05-07T20:32:35.3024929Z     compiled=False,
2025-05-07T20:32:35.3025137Z )
2025-05-07T20:32:35.3025448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3025936Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.3026215Z 
2025-05-07T20:32:35.3026295Z     @given(
2025-05-07T20:32:35.3026522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3026831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3027132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3027464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3027804Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3028083Z     )
2025-05-07T20:32:35.3028434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3028866Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3029101Z         self,
2025-05-07T20:32:35.3029304Z         T: int,
2025-05-07T20:32:35.3029509Z         D: int,
2025-05-07T20:32:35.3029725Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3030000Z         contiguous: bool,
2025-05-07T20:32:35.3030244Z         compiled: bool,
2025-05-07T20:32:35.3030459Z     ) -> None:
2025-05-07T20:32:35.3030685Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3030921Z     
2025-05-07T20:32:35.3031189Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3033235Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3035120Z 
2025-05-07T20:32:35.3035248Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3035457Z 
2025-05-07T20:32:35.3035562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3035978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3036380Z     T=16384,
2025-05-07T20:32:35.3036568Z     D=7168,
2025-05-07T20:32:35.3036763Z     scale_ub=1200.0,
2025-05-07T20:32:35.3036984Z     contiguous=True,
2025-05-07T20:32:35.3037245Z     compiled=False,
2025-05-07T20:32:35.3037445Z )
2025-05-07T20:32:35.3037793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3038285Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.3038562Z 
2025-05-07T20:32:35.3038636Z     @given(
2025-05-07T20:32:35.3038862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3039173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3039470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3039798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3040122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3040400Z     )
2025-05-07T20:32:35.3040745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3041231Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3041468Z         self,
2025-05-07T20:32:35.3041659Z         T: int,
2025-05-07T20:32:35.3041853Z         D: int,
2025-05-07T20:32:35.3042075Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3042343Z         contiguous: bool,
2025-05-07T20:32:35.3042581Z         compiled: bool,
2025-05-07T20:32:35.3042803Z     ) -> None:
2025-05-07T20:32:35.3043013Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3043257Z     
2025-05-07T20:32:35.3043528Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3045590Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.3047435Z 
2025-05-07T20:32:35.3047555Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.3047776Z 
2025-05-07T20:32:35.3047880Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3048287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3048679Z     T=128,
2025-05-07T20:32:35.3048857Z     D=5120,
2025-05-07T20:32:35.3049050Z     scale_ub=1200.0,
2025-05-07T20:32:35.3049273Z     contiguous=False,
2025-05-07T20:32:35.3049490Z     compiled=False,
2025-05-07T20:32:35.3049689Z )
2025-05-07T20:32:35.4460656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4461259Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.4461534Z 
2025-05-07T20:32:35.4461629Z     @given(
2025-05-07T20:32:35.4461858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4462171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4462494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4462824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4463157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4463447Z     )
2025-05-07T20:32:35.4463796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4464357Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4464601Z         self,
2025-05-07T20:32:35.4464792Z         T: int,
2025-05-07T20:32:35.4464994Z         D: int,
2025-05-07T20:32:35.4465218Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4465478Z         contiguous: bool,
2025-05-07T20:32:35.4465723Z         compiled: bool,
2025-05-07T20:32:35.4465952Z     ) -> None:
2025-05-07T20:32:35.4466166Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4466403Z     
2025-05-07T20:32:35.4466675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4467089Z     
2025-05-07T20:32:35.4467278Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.4467620Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.4467937Z         x = x_sign * x_clamp
2025-05-07T20:32:35.4468166Z         x0 = x[:, :D]
2025-05-07T20:32:35.4468374Z         x1 = x[:, D:]
2025-05-07T20:32:35.4468570Z     
2025-05-07T20:32:35.4468747Z         if contiguous:
2025-05-07T20:32:35.4468986Z             x0 = x0.contiguous()
2025-05-07T20:32:35.4469246Z             x1 = x1.contiguous()
2025-05-07T20:32:35.4469481Z     
2025-05-07T20:32:35.4469674Z         if scale_ub is not None:
2025-05-07T20:32:35.4469949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.4470282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.4470581Z             )
2025-05-07T20:32:35.4470769Z         else:
2025-05-07T20:32:35.4470977Z             scale_ub_tensor = None
2025-05-07T20:32:35.4471223Z     
2025-05-07T20:32:35.4471456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.4471767Z             op = silu_mul_quant
2025-05-07T20:32:35.4472016Z             if compiled:
2025-05-07T20:32:35.4472269Z                 op = torch.compile(op)
2025-05-07T20:32:35.4472557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4472817Z     
2025-05-07T20:32:35.4473009Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.4473172Z 
2025-05-07T20:32:35.4473342Z moe/activation_test.py:117: 
2025-05-07T20:32:35.4473636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4473963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.4474242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4474940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.4475633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.4476159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.4476833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.4477492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.4478018Z     kernel = self.compile(
2025-05-07T20:32:35.4478560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.4479212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.4479601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4479832Z 
2025-05-07T20:32:35.4480035Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0e04c40>
2025-05-07T20:32:35.4481107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.4482480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1229ea0>}
2025-05-07T20:32:35.4483803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.4484875Z context = <triton._C.libtriton.ir.context object at 0x7fcec0ed5e70>
2025-05-07T20:32:35.4485162Z 
2025-05-07T20:32:35.4485328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.4485841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.4486294Z                            module_map=module_map)
2025-05-07T20:32:35.4486694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.4487041Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.4487339Z E       ^
2025-05-07T20:32:35.4487793Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.4488234Z 
2025-05-07T20:32:35.4488645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.4489150Z 
2025-05-07T20:32:35.4489254Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4489651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4490266Z     T=2048,
2025-05-07T20:32:35.4490520Z     D=7168,
2025-05-07T20:32:35.4490712Z     scale_ub=None,
2025-05-07T20:32:35.4490923Z     contiguous=False,
2025-05-07T20:32:35.4491169Z     compiled=False,
2025-05-07T20:32:35.4491391Z )
2025-05-07T20:32:35.4491699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4492204Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.4492474Z 
2025-05-07T20:32:35.4492556Z     @given(
2025-05-07T20:32:35.4492780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4493083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4493392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4493850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4494180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4494458Z     )
2025-05-07T20:32:35.4494802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4495229Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4495467Z         self,
2025-05-07T20:32:35.4495658Z         T: int,
2025-05-07T20:32:35.4495851Z         D: int,
2025-05-07T20:32:35.4496064Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4496331Z         contiguous: bool,
2025-05-07T20:32:35.4496562Z         compiled: bool,
2025-05-07T20:32:35.4496785Z     ) -> None:
2025-05-07T20:32:35.4497003Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4497232Z     
2025-05-07T20:32:35.4497502Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4499565Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.4501477Z 
2025-05-07T20:32:35.4501598Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.4501815Z 
2025-05-07T20:32:35.4501926Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4502330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4502723Z     T=128,
2025-05-07T20:32:35.4502903Z     D=7168,
2025-05-07T20:32:35.4503085Z     scale_ub=1200.0,
2025-05-07T20:32:35.4503301Z     contiguous=True,
2025-05-07T20:32:35.4503589Z     compiled=True,
2025-05-07T20:32:35.4503779Z )
2025-05-07T20:32:35.4922147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4922675Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.4922952Z 
2025-05-07T20:32:35.4923034Z     @given(
2025-05-07T20:32:35.4923270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4923572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4923878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4924321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4924647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4925009Z     )
2025-05-07T20:32:35.4925351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4925788Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4926028Z         self,
2025-05-07T20:32:35.4926224Z         T: int,
2025-05-07T20:32:35.4926425Z         D: int,
2025-05-07T20:32:35.4926644Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4926912Z         contiguous: bool,
2025-05-07T20:32:35.4927152Z         compiled: bool,
2025-05-07T20:32:35.4927372Z     ) -> None:
2025-05-07T20:32:35.4927586Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4927831Z     
2025-05-07T20:32:35.4928103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4928440Z     
2025-05-07T20:32:35.4928640Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.4928925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.4929238Z         x = x_sign * x_clamp
2025-05-07T20:32:35.4929487Z         x0 = x[:, :D]
2025-05-07T20:32:35.4929702Z         x1 = x[:, D:]
2025-05-07T20:32:35.4929910Z     
2025-05-07T20:32:35.4930098Z         if contiguous:
2025-05-07T20:32:35.4930328Z             x0 = x0.contiguous()
2025-05-07T20:32:35.4930593Z             x1 = x1.contiguous()
2025-05-07T20:32:35.4930844Z     
2025-05-07T20:32:35.4931125Z         if scale_ub is not None:
2025-05-07T20:32:35.4931424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.4931763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.4937960Z             )
2025-05-07T20:32:35.4938174Z         else:
2025-05-07T20:32:35.4938399Z             scale_ub_tensor = None
2025-05-07T20:32:35.4938672Z     
2025-05-07T20:32:35.4938915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.4939234Z             op = silu_mul_quant
2025-05-07T20:32:35.4939503Z             if compiled:
2025-05-07T20:32:35.4939843Z                 op = torch.compile(op)
2025-05-07T20:32:35.4940156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4940447Z     
2025-05-07T20:32:35.4940650Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.4940826Z 
2025-05-07T20:32:35.4940928Z moe/activation_test.py:117: 
2025-05-07T20:32:35.4941233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4941584Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.4941866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4942436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.4943002Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.4943666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.4944360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.4944909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.4945600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.4946264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.4946909Z     kernel = self.compile(
2025-05-07T20:32:35.4947462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.4948125Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.4948521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4948759Z 
2025-05-07T20:32:35.4948971Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0dc9210>
2025-05-07T20:32:35.4950086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.4951510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec122b7f0>}
2025-05-07T20:32:35.4952844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.4953864Z context = <triton._C.libtriton.ir.context object at 0x7fcec0db85f0>
2025-05-07T20:32:35.4954156Z 
2025-05-07T20:32:35.4954323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.4954839Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.4955316Z                            module_map=module_map)
2025-05-07T20:32:35.4955685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.4956042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.4956300Z E       ^
2025-05-07T20:32:35.4956759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.4957208Z 
2025-05-07T20:32:35.4957671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.4958178Z 
2025-05-07T20:32:35.4958290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4958696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4959092Z     T=128,
2025-05-07T20:32:35.4959279Z     D=7168,
2025-05-07T20:32:35.4959471Z     scale_ub=1200.0,
2025-05-07T20:32:35.4959715Z     contiguous=True,
2025-05-07T20:32:35.4959939Z     compiled=False,
2025-05-07T20:32:35.4960149Z )
2025-05-07T20:32:35.4960462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4960956Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.4961227Z 
2025-05-07T20:32:35.4961304Z     @given(
2025-05-07T20:32:35.4961540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4961847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4962163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4962489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4962809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4963095Z     )
2025-05-07T20:32:35.4963447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4963887Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4964129Z         self,
2025-05-07T20:32:35.4964326Z         T: int,
2025-05-07T20:32:35.4964523Z         D: int,
2025-05-07T20:32:35.4964746Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4965018Z         contiguous: bool,
2025-05-07T20:32:35.4965258Z         compiled: bool,
2025-05-07T20:32:35.4965480Z     ) -> None:
2025-05-07T20:32:35.4965701Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4965945Z     
2025-05-07T20:32:35.4966214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4966613Z     
2025-05-07T20:32:35.4966804Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.4967100Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.4969126Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.4970998Z 
2025-05-07T20:32:35.4971123Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.4971334Z 
2025-05-07T20:32:35.4971441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4971850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4972255Z     T=128,
2025-05-07T20:32:35.4972441Z     D=5120,
2025-05-07T20:32:35.4972632Z     scale_ub=1200.0,
2025-05-07T20:32:35.4972861Z     contiguous=True,
2025-05-07T20:32:35.4973082Z     compiled=True,
2025-05-07T20:32:35.4973278Z )
2025-05-07T20:32:35.4973594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4974076Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.4974340Z 
2025-05-07T20:32:35.4974418Z     @given(
2025-05-07T20:32:35.4974649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4974958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4975267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4975587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4975908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4976187Z     )
2025-05-07T20:32:35.4976576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4977015Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4977255Z         self,
2025-05-07T20:32:35.4977447Z         T: int,
2025-05-07T20:32:35.4977643Z         D: int,
2025-05-07T20:32:35.4977862Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4978128Z         contiguous: bool,
2025-05-07T20:32:35.4978367Z         compiled: bool,
2025-05-07T20:32:35.4978589Z     ) -> None:
2025-05-07T20:32:35.4978804Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4979045Z     
2025-05-07T20:32:35.4979313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4979648Z     
2025-05-07T20:32:35.4979906Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.4980198Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.4982175Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.4984003Z 
2025-05-07T20:32:35.4984125Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.4984338Z 
2025-05-07T20:32:35.4984442Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4984851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4985255Z     T=128,
2025-05-07T20:32:35.4985440Z     D=7168,
2025-05-07T20:32:35.4985634Z     scale_ub=None,
2025-05-07T20:32:35.4985845Z     contiguous=True,
2025-05-07T20:32:35.4986068Z     compiled=True,
2025-05-07T20:32:35.4986316Z )
2025-05-07T20:32:35.6981481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6982015Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.6982288Z 
2025-05-07T20:32:35.6982370Z     @given(
2025-05-07T20:32:35.6982605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6982915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6983221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6983551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6984010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6984298Z     )
2025-05-07T20:32:35.6984718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6985158Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6985414Z         self,
2025-05-07T20:32:35.6985617Z         T: int,
2025-05-07T20:32:35.6985823Z         D: int,
2025-05-07T20:32:35.6986057Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6986340Z         contiguous: bool,
2025-05-07T20:32:35.6986580Z         compiled: bool,
2025-05-07T20:32:35.6986814Z     ) -> None:
2025-05-07T20:32:35.6987038Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6987289Z     
2025-05-07T20:32:35.6987568Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6989618Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.6991778Z 
2025-05-07T20:32:35.6991993Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.6992211Z 
2025-05-07T20:32:35.7002894Z FAILED
2025-05-07T20:32:35.7003177Z 
2025-05-07T20:32:35.7003508Z =================================== FAILURES ===================================
2025-05-07T20:32:35.7004152Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:35.7004763Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:35.7005647Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:35.7006524Z   |     yield
2025-05-07T20:32:35.7007136Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:35.7007881Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:35.7008660Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:35.7009406Z   |     method()
2025-05-07T20:32:35.7010080Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:35.7010802Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7011560Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:35.7012347Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:35.7013033Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:35.7013753Z   +-+---------------- 1 ----------------
2025-05-07T20:32:35.7014167Z     | Traceback (most recent call last):
2025-05-07T20:32:35.7015144Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.7016436Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7019294Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.7022465Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.7023093Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7023648Z     |     T=2048,
2025-05-07T20:32:35.7023971Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:35.7024451Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:35.7024937Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:35.7025434Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:35.7025842Z     | )
2025-05-07T20:32:35.7026090Z     | 
2025-05-07T20:32:35.7026815Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:35.7027641Z     +---------------- 2 ----------------
2025-05-07T20:32:35.7028049Z     | Traceback (most recent call last):
2025-05-07T20:32:35.7029023Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.7030087Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7032923Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.7035664Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.7036300Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7036854Z     |     T=128,
2025-05-07T20:32:35.7037141Z     |     D=7168,
2025-05-07T20:32:35.7037440Z     |     scale_ub=None,
2025-05-07T20:32:35.7037771Z     |     contiguous=True,
2025-05-07T20:32:35.7038106Z     |     compiled=True,
2025-05-07T20:32:35.7038408Z     | )
2025-05-07T20:32:35.7038655Z     | 
2025-05-07T20:32:35.7039375Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.7040193Z     +---------------- 3 ----------------
2025-05-07T20:32:35.7040587Z     | Traceback (most recent call last):
2025-05-07T20:32:35.7041464Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:35.7042235Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7044250Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.7046252Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.7046680Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7047080Z     |     T=128,
2025-05-07T20:32:35.7047274Z     |     D=5120,
2025-05-07T20:32:35.7047481Z     |     scale_ub=1200.0,
2025-05-07T20:32:35.7047717Z     |     contiguous=True,
2025-05-07T20:32:35.7048004Z     |     compiled=True,
2025-05-07T20:32:35.7048218Z     | )
2025-05-07T20:32:35.7048392Z     | 
2025-05-07T20:32:35.7048954Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.7049557Z     +---------------- 4 ----------------
2025-05-07T20:32:35.7049840Z     | Traceback (most recent call last):
2025-05-07T20:32:35.7050544Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:35.7051301Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7051945Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:35.7052630Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7053653Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:35.7054799Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7055659Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:35.7056711Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7057921Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:35.7059059Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7065346Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:35.7066508Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7067596Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:35.7068553Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7069456Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:35.7070247Z     |     fn()
2025-05-07T20:32:35.7071034Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:35.7071896Z     |     self.fn.run(
2025-05-07T20:32:35.7072628Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:35.7073420Z     |     kernel = self.compile(
2025-05-07T20:32:35.7074253Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:35.7075231Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7076180Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.7077281Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7078126Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7078613Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7078978Z     | ^
2025-05-07T20:32:35.7079602Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7080384Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:35.7080945Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:35.7081761Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7082405Z     |     T=1,  # or any other generated value
2025-05-07T20:32:35.7082841Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:35.7083308Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:35.7083806Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:35.7084320Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:35.7084732Z     | )
2025-05-07T20:32:35.7084985Z     | 
2025-05-07T20:32:35.7085696Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:35.7086516Z     +------------------------------------
2025-05-07T20:32:35.7086997Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:35.7087503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7088052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7088585Z     T=1,
2025-05-07T20:32:35.7088834Z     D=5120,
2025-05-07T20:32:35.7089099Z     scale_ub=None,
2025-05-07T20:32:35.7089396Z     contiguous=True,
2025-05-07T20:32:35.7089687Z     compiled=True,
2025-05-07T20:32:35.7090270Z )
2025-05-07T20:32:35.7090707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7091524Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7091885Z 
2025-05-07T20:32:35.7091993Z     @given(
2025-05-07T20:32:35.7092310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7092730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7093136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7093576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7094027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7094431Z     )
2025-05-07T20:32:35.7094914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7095529Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7095865Z         self,
2025-05-07T20:32:35.7096577Z         T: int,
2025-05-07T20:32:35.7096847Z         D: int,
2025-05-07T20:32:35.7097138Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7097513Z         contiguous: bool,
2025-05-07T20:32:35.7097867Z         compiled: bool,
2025-05-07T20:32:35.7098189Z     ) -> None:
2025-05-07T20:32:35.7098488Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7098836Z     
2025-05-07T20:32:35.7099214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7099681Z     
2025-05-07T20:32:35.7100082Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7100489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7100921Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7101259Z         x0 = x[:, :D]
2025-05-07T20:32:35.7101565Z         x1 = x[:, D:]
2025-05-07T20:32:35.7101847Z     
2025-05-07T20:32:35.7102104Z         if contiguous:
2025-05-07T20:32:35.7102425Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7102770Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7103096Z     
2025-05-07T20:32:35.7103358Z         if scale_ub is not None:
2025-05-07T20:32:35.7103732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7104283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7104707Z             )
2025-05-07T20:32:35.7104976Z         else:
2025-05-07T20:32:35.7105266Z             scale_ub_tensor = None
2025-05-07T20:32:35.7105616Z     
2025-05-07T20:32:35.7105936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7106362Z             op = silu_mul_quant
2025-05-07T20:32:35.7106711Z             if compiled:
2025-05-07T20:32:35.7107689Z                 op = torch.compile(op)
2025-05-07T20:32:35.7108100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7108576Z     
2025-05-07T20:32:35.7108836Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7109286Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7109680Z     
2025-05-07T20:32:35.7109999Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7110427Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7110813Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7111274Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7111732Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7112128Z     
2025-05-07T20:32:35.7112388Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7112641Z 
2025-05-07T20:32:35.7112775Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7113155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7113588Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7114017Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7135386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7136451Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7137182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7138179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7139096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7140261Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7141284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7142324Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7143338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7144221Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7145028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7145741Z     fn()
2025-05-07T20:32:35.7146458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7147251Z     self.fn.run(
2025-05-07T20:32:35.7147932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7148664Z     kernel = self.compile(
2025-05-07T20:32:35.7149410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7150297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7150845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7151152Z 
2025-05-07T20:32:35.7151439Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d493eb0>
2025-05-07T20:32:35.7152882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7154782Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09d57caf0>}
2025-05-07T20:32:35.7156594Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7158163Z context = <triton._C.libtriton.ir.context object at 0x7fd0de492930>
2025-05-07T20:32:35.7158555Z 
2025-05-07T20:32:35.7158787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7159487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7160105Z                            module_map=module_map)
2025-05-07T20:32:35.7160579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7161039Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7161379Z E       ^
2025-05-07T20:32:35.7161988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7162589Z 
2025-05-07T20:32:35.7163158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7163835Z 
2025-05-07T20:32:35.7163977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7164505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7165025Z     T=2048,
2025-05-07T20:32:35.7165269Z     D=5120,
2025-05-07T20:32:35.7165509Z     scale_ub=1200.0,
2025-05-07T20:32:35.7165797Z     contiguous=True,
2025-05-07T20:32:35.7166080Z     compiled=False,
2025-05-07T20:32:35.7166357Z )
2025-05-07T20:32:35.7166831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7167468Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.7167810Z 
2025-05-07T20:32:35.7167913Z     @given(
2025-05-07T20:32:35.7168202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7168603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7169003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7169422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7169857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7170236Z     )
2025-05-07T20:32:35.7170682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7171251Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7171564Z         self,
2025-05-07T20:32:35.7171804Z         T: int,
2025-05-07T20:32:35.7172065Z         D: int,
2025-05-07T20:32:35.7172351Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7172701Z         contiguous: bool,
2025-05-07T20:32:35.7173013Z         compiled: bool,
2025-05-07T20:32:35.7173323Z     ) -> None:
2025-05-07T20:32:35.7173626Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7173958Z     
2025-05-07T20:32:35.7174308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7174750Z     
2025-05-07T20:32:35.7174989Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7175370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7175772Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7176076Z         x0 = x[:, :D]
2025-05-07T20:32:35.7176367Z         x1 = x[:, D:]
2025-05-07T20:32:35.7176644Z     
2025-05-07T20:32:35.7176878Z         if contiguous:
2025-05-07T20:32:35.7177182Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7177513Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7177830Z     
2025-05-07T20:32:35.7178138Z         if scale_ub is not None:
2025-05-07T20:32:35.7178486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7178916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7179310Z             )
2025-05-07T20:32:35.7179552Z         else:
2025-05-07T20:32:35.7179967Z             scale_ub_tensor = None
2025-05-07T20:32:35.7180302Z     
2025-05-07T20:32:35.7180595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7181028Z             op = silu_mul_quant
2025-05-07T20:32:35.7181378Z             if compiled:
2025-05-07T20:32:35.7181751Z                 op = torch.compile(op)
2025-05-07T20:32:35.7182137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7182538Z     
2025-05-07T20:32:35.7182789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7182999Z 
2025-05-07T20:32:35.7183125Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7183516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7183952Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7184309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7185214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7186144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7186865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7187777Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7188682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7189401Z     kernel = self.compile(
2025-05-07T20:32:35.7190480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7191386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7192079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7192396Z 
2025-05-07T20:32:35.7192682Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d369960>
2025-05-07T20:32:35.7194122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7195988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09d45d990>}
2025-05-07T20:32:35.7197867Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7199266Z context = <triton._C.libtriton.ir.context object at 0x7fd09d7c6230>
2025-05-07T20:32:35.7199647Z 
2025-05-07T20:32:35.7199866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7200564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7201198Z                            module_map=module_map)
2025-05-07T20:32:35.7201681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7202142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7202491Z E       ^
2025-05-07T20:32:35.7203103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7203717Z 
2025-05-07T20:32:35.7204308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7205037Z 
2025-05-07T20:32:35.7205179Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7205839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7206402Z     T=2048,
2025-05-07T20:32:35.7206655Z     D=5120,
2025-05-07T20:32:35.7206918Z     scale_ub=1200.0,
2025-05-07T20:32:35.7207232Z     contiguous=True,
2025-05-07T20:32:35.7207534Z     compiled=True,
2025-05-07T20:32:35.7207813Z )
2025-05-07T20:32:35.7208245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7208917Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7209386Z 
2025-05-07T20:32:35.7209491Z     @given(
2025-05-07T20:32:35.7209805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7210302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7210725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7211217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7211680Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7212080Z     )
2025-05-07T20:32:35.7212562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7213160Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7213501Z         self,
2025-05-07T20:32:35.7213753Z         T: int,
2025-05-07T20:32:35.7214017Z         D: int,
2025-05-07T20:32:35.7214313Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7214675Z         contiguous: bool,
2025-05-07T20:32:35.7214998Z         compiled: bool,
2025-05-07T20:32:35.7215297Z     ) -> None:
2025-05-07T20:32:35.7215601Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7215915Z     
2025-05-07T20:32:35.7216270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7216697Z     
2025-05-07T20:32:35.7216955Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7217315Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7217733Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7218082Z         x0 = x[:, :D]
2025-05-07T20:32:35.7218450Z         x1 = x[:, D:]
2025-05-07T20:32:35.7218740Z     
2025-05-07T20:32:35.7218996Z         if contiguous:
2025-05-07T20:32:35.7219318Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7219670Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7220149Z     
2025-05-07T20:32:35.7220426Z         if scale_ub is not None:
2025-05-07T20:32:35.7220800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7221158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7221472Z             )
2025-05-07T20:32:35.7221669Z         else:
2025-05-07T20:32:35.7221877Z             scale_ub_tensor = None
2025-05-07T20:32:35.7222135Z     
2025-05-07T20:32:35.7222374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7222684Z             op = silu_mul_quant
2025-05-07T20:32:35.7222940Z             if compiled:
2025-05-07T20:32:35.7223193Z                 op = torch.compile(op)
2025-05-07T20:32:35.7223490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7223768Z     
2025-05-07T20:32:35.7223967Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7224255Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7224547Z     
2025-05-07T20:32:35.7224787Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7225116Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7225413Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7225730Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7226096Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7226406Z     
2025-05-07T20:32:35.7226618Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7226816Z 
2025-05-07T20:32:35.7226924Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7227220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7227556Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7227958Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7228745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7229489Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7230039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7230720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7231512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7232237Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7232987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7233738Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7234463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7235098Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7235715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7236243Z     fn()
2025-05-07T20:32:35.7236746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7237338Z     self.fn.run(
2025-05-07T20:32:35.7237808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7238337Z     kernel = self.compile(
2025-05-07T20:32:35.7238915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7239584Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7239980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7240208Z 
2025-05-07T20:32:35.7240424Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d493cd0>
2025-05-07T20:32:35.7241488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7242858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097e2d3f0>}
2025-05-07T20:32:35.7244195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7245216Z context = <triton._C.libtriton.ir.context object at 0x7fd097d39330>
2025-05-07T20:32:35.7245501Z 
2025-05-07T20:32:35.7245666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7246187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7246655Z                            module_map=module_map)
2025-05-07T20:32:35.7247024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7247381Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7247649Z E       ^
2025-05-07T20:32:35.7248112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7248554Z 
2025-05-07T20:32:35.7248973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7250098Z 
2025-05-07T20:32:35.7250202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7250619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7251024Z     T=16384,
2025-05-07T20:32:35.7251215Z     D=7168,
2025-05-07T20:32:35.7251416Z     scale_ub=1200.0,
2025-05-07T20:32:35.7251650Z     contiguous=False,
2025-05-07T20:32:35.7251873Z     compiled=False,
2025-05-07T20:32:35.7252082Z )
2025-05-07T20:32:35.7252402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7252943Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7253230Z 
2025-05-07T20:32:35.7253349Z     @given(
2025-05-07T20:32:35.7253588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7253907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7254209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7254552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7254880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7255160Z     )
2025-05-07T20:32:35.7255513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7255959Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7256205Z         self,
2025-05-07T20:32:35.7256406Z         T: int,
2025-05-07T20:32:35.7256608Z         D: int,
2025-05-07T20:32:35.7256823Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7257097Z         contiguous: bool,
2025-05-07T20:32:35.7257344Z         compiled: bool,
2025-05-07T20:32:35.7257563Z     ) -> None:
2025-05-07T20:32:35.7257786Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7258034Z     
2025-05-07T20:32:35.7258308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7258645Z     
2025-05-07T20:32:35.7258842Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7259184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7259495Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7259742Z         x0 = x[:, :D]
2025-05-07T20:32:35.7260091Z         x1 = x[:, D:]
2025-05-07T20:32:35.7260294Z     
2025-05-07T20:32:35.7260485Z         if contiguous:
2025-05-07T20:32:35.7260717Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7260970Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7261213Z     
2025-05-07T20:32:35.7261409Z         if scale_ub is not None:
2025-05-07T20:32:35.7261678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7262017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7262325Z             )
2025-05-07T20:32:35.7262517Z         else:
2025-05-07T20:32:35.7262731Z             scale_ub_tensor = None
2025-05-07T20:32:35.7262984Z     
2025-05-07T20:32:35.7263215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7263521Z             op = silu_mul_quant
2025-05-07T20:32:35.7263774Z             if compiled:
2025-05-07T20:32:35.7264024Z                 op = torch.compile(op)
2025-05-07T20:32:35.7264316Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7264593Z     
2025-05-07T20:32:35.7264787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7264950Z 
2025-05-07T20:32:35.7265051Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7265345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7265678Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7265956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7266644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7267334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7267870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7268542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7269262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7269799Z     kernel = self.compile(
2025-05-07T20:32:35.7270336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7270994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7271425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7271694Z 
2025-05-07T20:32:35.7271948Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09c32ae90>
2025-05-07T20:32:35.7273019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7274402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097e2ce50>}
2025-05-07T20:32:35.7275733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7276748Z context = <triton._C.libtriton.ir.context object at 0x7fd097db68b0>
2025-05-07T20:32:35.7277038Z 
2025-05-07T20:32:35.7277215Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7277730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7278196Z                            module_map=module_map)
2025-05-07T20:32:35.7278561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7278909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7279169Z E       ^
2025-05-07T20:32:35.7279673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7280128Z 
2025-05-07T20:32:35.7280545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7281051Z 
2025-05-07T20:32:35.7281157Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7281566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7281967Z     T=1,
2025-05-07T20:32:35.7282154Z     D=7168,
2025-05-07T20:32:35.7282344Z     scale_ub=None,
2025-05-07T20:32:35.7282563Z     contiguous=True,
2025-05-07T20:32:35.7282787Z     compiled=True,
2025-05-07T20:32:35.7282985Z )
2025-05-07T20:32:35.7283305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7283784Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7284042Z 
2025-05-07T20:32:35.7284122Z     @given(
2025-05-07T20:32:35.7284354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7284667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7284967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7285293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7285619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7285902Z     )
2025-05-07T20:32:35.7286245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7286684Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7286924Z         self,
2025-05-07T20:32:35.7287114Z         T: int,
2025-05-07T20:32:35.7287311Z         D: int,
2025-05-07T20:32:35.7287531Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7287795Z         contiguous: bool,
2025-05-07T20:32:35.7288038Z         compiled: bool,
2025-05-07T20:32:35.7288311Z     ) -> None:
2025-05-07T20:32:35.7288526Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7288769Z     
2025-05-07T20:32:35.7289039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7289371Z     
2025-05-07T20:32:35.7289564Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7290100Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7290473Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7290715Z         x0 = x[:, :D]
2025-05-07T20:32:35.7290932Z         x1 = x[:, D:]
2025-05-07T20:32:35.7291135Z     
2025-05-07T20:32:35.7291444Z         if contiguous:
2025-05-07T20:32:35.7291684Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7292001Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7292245Z     
2025-05-07T20:32:35.7292439Z         if scale_ub is not None:
2025-05-07T20:32:35.7292715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7293043Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7293361Z             )
2025-05-07T20:32:35.7293558Z         else:
2025-05-07T20:32:35.7293769Z             scale_ub_tensor = None
2025-05-07T20:32:35.7294021Z     
2025-05-07T20:32:35.7294254Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7294559Z             op = silu_mul_quant
2025-05-07T20:32:35.7294811Z             if compiled:
2025-05-07T20:32:35.7295060Z                 op = torch.compile(op)
2025-05-07T20:32:35.7295348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7295620Z     
2025-05-07T20:32:35.7295821Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7296101Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7296394Z     
2025-05-07T20:32:35.7296638Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7296976Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7297264Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7297576Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7298004Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7298310Z     
2025-05-07T20:32:35.7298515Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7298707Z 
2025-05-07T20:32:35.7298814Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7299111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7299443Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7299877Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7300676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7301420Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7301962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7302641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7303325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7304035Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7304783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7305524Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7306242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7306879Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7307474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7307989Z     fn()
2025-05-07T20:32:35.7308596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7309183Z     self.fn.run(
2025-05-07T20:32:35.7309649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7310174Z     kernel = self.compile(
2025-05-07T20:32:35.7310712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7311363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7311801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7312064Z 
2025-05-07T20:32:35.7319541Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09d575600>
2025-05-07T20:32:35.7320660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7322033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bc5000>}
2025-05-07T20:32:35.7323382Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7324403Z context = <triton._C.libtriton.ir.context object at 0x7fd097ce1570>
2025-05-07T20:32:35.7324690Z 
2025-05-07T20:32:35.7324865Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7325386Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7325857Z                            module_map=module_map)
2025-05-07T20:32:35.7326307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7326666Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7326936Z E       ^
2025-05-07T20:32:35.7327404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7327847Z 
2025-05-07T20:32:35.7328268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7328774Z 
2025-05-07T20:32:35.7328878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7329297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7329696Z     T=4096,
2025-05-07T20:32:35.7329886Z     D=5120,
2025-05-07T20:32:35.7330081Z     scale_ub=None,
2025-05-07T20:32:35.7330300Z     contiguous=False,
2025-05-07T20:32:35.7330528Z     compiled=False,
2025-05-07T20:32:35.7330734Z )
2025-05-07T20:32:35.7331084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7331609Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.7331881Z 
2025-05-07T20:32:35.7331961Z     @given(
2025-05-07T20:32:35.7332197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7332517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7332820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7333150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7333482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7333762Z     )
2025-05-07T20:32:35.7334118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7334559Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7334804Z         self,
2025-05-07T20:32:35.7334998Z         T: int,
2025-05-07T20:32:35.7335202Z         D: int,
2025-05-07T20:32:35.7335427Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7335752Z         contiguous: bool,
2025-05-07T20:32:35.7336000Z         compiled: bool,
2025-05-07T20:32:35.7336235Z     ) -> None:
2025-05-07T20:32:35.7336454Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7336703Z     
2025-05-07T20:32:35.7336982Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7337316Z     
2025-05-07T20:32:35.7337514Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7337813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7338117Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7338412Z         x0 = x[:, :D]
2025-05-07T20:32:35.7338632Z         x1 = x[:, D:]
2025-05-07T20:32:35.7338836Z     
2025-05-07T20:32:35.7339068Z         if contiguous:
2025-05-07T20:32:35.7339310Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7339566Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7340547Z     
2025-05-07T20:32:35.7340747Z         if scale_ub is not None:
2025-05-07T20:32:35.7341024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7341361Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7341668Z             )
2025-05-07T20:32:35.7341864Z         else:
2025-05-07T20:32:35.7342074Z             scale_ub_tensor = None
2025-05-07T20:32:35.7342323Z     
2025-05-07T20:32:35.7342557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7342866Z             op = silu_mul_quant
2025-05-07T20:32:35.7343120Z             if compiled:
2025-05-07T20:32:35.7343368Z                 op = torch.compile(op)
2025-05-07T20:32:35.7343673Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7343942Z     
2025-05-07T20:32:35.7344139Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7344306Z 
2025-05-07T20:32:35.7344412Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7344702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7345034Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7345319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7346054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7346744Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7347286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7347961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7348617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7349157Z     kernel = self.compile(
2025-05-07T20:32:35.7349698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7350343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7350732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7350961Z 
2025-05-07T20:32:35.7351164Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097be33a0>
2025-05-07T20:32:35.7352224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7353578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bc5a20>}
2025-05-07T20:32:35.7354922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7355926Z context = <triton._C.libtriton.ir.context object at 0x7fd097a87f30>
2025-05-07T20:32:35.7356260Z 
2025-05-07T20:32:35.7356432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7356948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7357399Z                            module_map=module_map)
2025-05-07T20:32:35.7357762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7358108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7358360Z E       ^
2025-05-07T20:32:35.7358812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7359309Z 
2025-05-07T20:32:35.7359754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7360266Z 
2025-05-07T20:32:35.7360373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7360777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7361175Z     T=4096,
2025-05-07T20:32:35.7361363Z     D=7168,
2025-05-07T20:32:35.7361550Z     scale_ub=None,
2025-05-07T20:32:35.7361756Z     contiguous=False,
2025-05-07T20:32:35.7361978Z     compiled=False,
2025-05-07T20:32:35.7362176Z )
2025-05-07T20:32:35.7362483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7362971Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.7363241Z 
2025-05-07T20:32:35.7363322Z     @given(
2025-05-07T20:32:35.7363549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7363854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7364157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7364489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7364803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7365081Z     )
2025-05-07T20:32:35.7365466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7365907Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7366151Z         self,
2025-05-07T20:32:35.7366340Z         T: int,
2025-05-07T20:32:35.7366528Z         D: int,
2025-05-07T20:32:35.7366739Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7367005Z         contiguous: bool,
2025-05-07T20:32:35.7367235Z         compiled: bool,
2025-05-07T20:32:35.7367449Z     ) -> None:
2025-05-07T20:32:35.7367663Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7367895Z     
2025-05-07T20:32:35.7368164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7368497Z     
2025-05-07T20:32:35.7368684Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7368971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7369272Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7369505Z         x0 = x[:, :D]
2025-05-07T20:32:35.7369708Z         x1 = x[:, D:]
2025-05-07T20:32:35.7369915Z     
2025-05-07T20:32:35.7370100Z         if contiguous:
2025-05-07T20:32:35.7370325Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7370575Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7370805Z     
2025-05-07T20:32:35.7370987Z         if scale_ub is not None:
2025-05-07T20:32:35.7371255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7371582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7371874Z             )
2025-05-07T20:32:35.7372063Z         else:
2025-05-07T20:32:35.7372272Z             scale_ub_tensor = None
2025-05-07T20:32:35.7372516Z     
2025-05-07T20:32:35.7372742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7373051Z             op = silu_mul_quant
2025-05-07T20:32:35.7373290Z             if compiled:
2025-05-07T20:32:35.7373532Z                 op = torch.compile(op)
2025-05-07T20:32:35.7373822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7374143Z     
2025-05-07T20:32:35.7374330Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7374499Z 
2025-05-07T20:32:35.7374596Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7374888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7375211Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7375491Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7376165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7376894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7377462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7378143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7378798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7379317Z     kernel = self.compile(
2025-05-07T20:32:35.7379978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7380632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7381024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7381246Z 
2025-05-07T20:32:35.7381446Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097b0a050>
2025-05-07T20:32:35.7382505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7383875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bc6560>}
2025-05-07T20:32:35.7385268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7386279Z context = <triton._C.libtriton.ir.context object at 0x7fd097ac7170>
2025-05-07T20:32:35.7386565Z 
2025-05-07T20:32:35.7386729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7387239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7387696Z                            module_map=module_map)
2025-05-07T20:32:35.7388053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7388406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7388658Z E       ^
2025-05-07T20:32:35.7389109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7389558Z 
2025-05-07T20:32:35.7390252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7390789Z 
2025-05-07T20:32:35.7390894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7391299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7391686Z     T=128,
2025-05-07T20:32:35.7391869Z     D=7168,
2025-05-07T20:32:35.7392060Z     scale_ub=None,
2025-05-07T20:32:35.7392271Z     contiguous=False,
2025-05-07T20:32:35.7392491Z     compiled=True,
2025-05-07T20:32:35.7392697Z )
2025-05-07T20:32:35.7393006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7393496Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7393764Z 
2025-05-07T20:32:35.7393841Z     @given(
2025-05-07T20:32:35.7394069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7394378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7394783Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7395112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7395436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7395708Z     )
2025-05-07T20:32:35.7396055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7396492Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7396724Z         self,
2025-05-07T20:32:35.7396913Z         T: int,
2025-05-07T20:32:35.7397176Z         D: int,
2025-05-07T20:32:35.7397388Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7397659Z         contiguous: bool,
2025-05-07T20:32:35.7397948Z         compiled: bool,
2025-05-07T20:32:35.7398167Z     ) -> None:
2025-05-07T20:32:35.7398382Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7398620Z     
2025-05-07T20:32:35.7398881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7399218Z     
2025-05-07T20:32:35.7399411Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7399698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7399998Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7400237Z         x0 = x[:, :D]
2025-05-07T20:32:35.7400450Z         x1 = x[:, D:]
2025-05-07T20:32:35.7400646Z     
2025-05-07T20:32:35.7400828Z         if contiguous:
2025-05-07T20:32:35.7401058Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7401308Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7401542Z     
2025-05-07T20:32:35.7401735Z         if scale_ub is not None:
2025-05-07T20:32:35.7401999Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7402329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7402627Z             )
2025-05-07T20:32:35.7402809Z         else:
2025-05-07T20:32:35.7403019Z             scale_ub_tensor = None
2025-05-07T20:32:35.7403265Z     
2025-05-07T20:32:35.7403557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7403869Z             op = silu_mul_quant
2025-05-07T20:32:35.7404117Z             if compiled:
2025-05-07T20:32:35.7404362Z                 op = torch.compile(op)
2025-05-07T20:32:35.7404649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7404916Z     
2025-05-07T20:32:35.7405109Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7405384Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7405667Z     
2025-05-07T20:32:35.7405904Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7406232Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7406526Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7406840Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7407187Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7407493Z     
2025-05-07T20:32:35.7407693Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7407887Z 
2025-05-07T20:32:35.7407992Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7408280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7408609Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7408931Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7409700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7409807Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7410171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7410393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7410754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7411060Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7411460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7411708Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7412083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7412246Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7412666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7412748Z     fn()
2025-05-07T20:32:35.7413143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7413223Z     self.fn.run(
2025-05-07T20:32:35.7413568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7413660Z     kernel = self.compile(
2025-05-07T20:32:35.7414039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7414212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7414338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7414343Z 
2025-05-07T20:32:35.7414547Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097783610>
2025-05-07T20:32:35.7415333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7415869Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097bca680>}
2025-05-07T20:32:35.7416619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7416809Z context = <triton._C.libtriton.ir.context object at 0x7fd097337a70>
2025-05-07T20:32:35.7416813Z 
2025-05-07T20:32:35.7416982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7417246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7417362Z                            module_map=module_map)
2025-05-07T20:32:35.7417522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7417621Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7417701Z E       ^
2025-05-07T20:32:35.7418058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7418065Z 
2025-05-07T20:32:35.7418479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7418489Z 
2025-05-07T20:32:35.7418593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7418812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7418890Z     T=128,
2025-05-07T20:32:35.7418965Z     D=7168,
2025-05-07T20:32:35.7419050Z     scale_ub=None,
2025-05-07T20:32:35.7419141Z     contiguous=False,
2025-05-07T20:32:35.7419222Z     compiled=False,
2025-05-07T20:32:35.7419290Z )
2025-05-07T20:32:35.7419517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7419688Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.7419693Z 
2025-05-07T20:32:35.7419887Z     @given(
2025-05-07T20:32:35.7420069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7420166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7420285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7420401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7420515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7420592Z     )
2025-05-07T20:32:35.7420834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7420926Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7421048Z         self,
2025-05-07T20:32:35.7421125Z         T: int,
2025-05-07T20:32:35.7421200Z         D: int,
2025-05-07T20:32:35.7421339Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7421429Z         contiguous: bool,
2025-05-07T20:32:35.7421519Z         compiled: bool,
2025-05-07T20:32:35.7421595Z     ) -> None:
2025-05-07T20:32:35.7421687Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7421760Z     
2025-05-07T20:32:35.7421927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7421997Z     
2025-05-07T20:32:35.7422093Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7422216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7422300Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7422384Z         x0 = x[:, :D]
2025-05-07T20:32:35.7422461Z         x1 = x[:, D:]
2025-05-07T20:32:35.7422531Z     
2025-05-07T20:32:35.7422619Z         if contiguous:
2025-05-07T20:32:35.7422710Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7422806Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7422877Z     
2025-05-07T20:32:35.7422968Z         if scale_ub is not None:
2025-05-07T20:32:35.7423080Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7423215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7423288Z             )
2025-05-07T20:32:35.7423364Z         else:
2025-05-07T20:32:35.7423506Z             scale_ub_tensor = None
2025-05-07T20:32:35.7423581Z     
2025-05-07T20:32:35.7423712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7423804Z             op = silu_mul_quant
2025-05-07T20:32:35.7423887Z             if compiled:
2025-05-07T20:32:35.7423993Z                 op = torch.compile(op)
2025-05-07T20:32:35.7424098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7424170Z     
2025-05-07T20:32:35.7424264Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7424268Z 
2025-05-07T20:32:35.7424364Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7424499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7424602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7424699Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7425197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7425297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7425659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7425877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7426211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7426308Z     kernel = self.compile(
2025-05-07T20:32:35.7426684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7426860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7426991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7426996Z 
2025-05-07T20:32:35.7427197Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0977f3430>
2025-05-07T20:32:35.7427979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7428521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097c25f30>}
2025-05-07T20:32:35.7429271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7429558Z context = <triton._C.libtriton.ir.context object at 0x7fd0973b9a30>
2025-05-07T20:32:35.7429563Z 
2025-05-07T20:32:35.7429729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7429991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7430103Z                            module_map=module_map)
2025-05-07T20:32:35.7430262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7430365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7430437Z E       ^
2025-05-07T20:32:35.7430790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7430795Z 
2025-05-07T20:32:35.7431203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7431211Z 
2025-05-07T20:32:35.7431312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7431538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7431610Z     T=4096,
2025-05-07T20:32:35.7431684Z     D=5120,
2025-05-07T20:32:35.7431764Z     scale_ub=1200.0,
2025-05-07T20:32:35.7431844Z     contiguous=True,
2025-05-07T20:32:35.7431931Z     compiled=False,
2025-05-07T20:32:35.7432043Z )
2025-05-07T20:32:35.7432259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7432435Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.7432439Z 
2025-05-07T20:32:35.7432515Z     @given(
2025-05-07T20:32:35.7432630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7432734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7432849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7432972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7433087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7433160Z     )
2025-05-07T20:32:35.7433412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7433504Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7433576Z         self,
2025-05-07T20:32:35.7433660Z         T: int,
2025-05-07T20:32:35.7433739Z         D: int,
2025-05-07T20:32:35.7433838Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7433934Z         contiguous: bool,
2025-05-07T20:32:35.7434020Z         compiled: bool,
2025-05-07T20:32:35.7434097Z     ) -> None:
2025-05-07T20:32:35.7434193Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7434260Z     
2025-05-07T20:32:35.7434432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7434504Z     
2025-05-07T20:32:35.7434594Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7434723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7434810Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7434889Z         x0 = x[:, :D]
2025-05-07T20:32:35.7434971Z         x1 = x[:, D:]
2025-05-07T20:32:35.7435044Z     
2025-05-07T20:32:35.7435127Z         if contiguous:
2025-05-07T20:32:35.7435225Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7435311Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7435424Z     
2025-05-07T20:32:35.7435524Z         if scale_ub is not None:
2025-05-07T20:32:35.7435628Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7435763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7435840Z             )
2025-05-07T20:32:35.7435913Z         else:
2025-05-07T20:32:35.7436012Z             scale_ub_tensor = None
2025-05-07T20:32:35.7436081Z     
2025-05-07T20:32:35.7436213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7436312Z             op = silu_mul_quant
2025-05-07T20:32:35.7436439Z             if compiled:
2025-05-07T20:32:35.7436538Z                 op = torch.compile(op)
2025-05-07T20:32:35.7436686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7436756Z     
2025-05-07T20:32:35.7436847Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7436851Z 
2025-05-07T20:32:35.7436955Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7437082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7437190Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7437289Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7437784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7437887Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7438247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7438469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7438822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7438915Z     kernel = self.compile(
2025-05-07T20:32:35.7439300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7439517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7439641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7439645Z 
2025-05-07T20:32:35.7439855Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0976961d0>
2025-05-07T20:32:35.7440618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7441123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097c25b40>}
2025-05-07T20:32:35.7441868Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7442058Z context = <triton._C.libtriton.ir.context object at 0x7fd09736a770>
2025-05-07T20:32:35.7442068Z 
2025-05-07T20:32:35.7442233Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7442493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7442604Z                            module_map=module_map)
2025-05-07T20:32:35.7442763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7442860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7442943Z E       ^
2025-05-07T20:32:35.7443294Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7443298Z 
2025-05-07T20:32:35.7443719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7443723Z 
2025-05-07T20:32:35.7443871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7444095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7444175Z     T=1,
2025-05-07T20:32:35.7444248Z     D=5120,
2025-05-07T20:32:35.7444328Z     scale_ub=None,
2025-05-07T20:32:35.7444416Z     contiguous=True,
2025-05-07T20:32:35.7444495Z     compiled=True,
2025-05-07T20:32:35.7444563Z )
2025-05-07T20:32:35.7444788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7444946Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7445023Z 
2025-05-07T20:32:35.7445096Z     @given(
2025-05-07T20:32:35.7445254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7445351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7445468Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7445581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7445698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7445773Z     )
2025-05-07T20:32:35.7446022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7446118Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7446193Z         self,
2025-05-07T20:32:35.7446265Z         T: int,
2025-05-07T20:32:35.7446340Z         D: int,
2025-05-07T20:32:35.7446438Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7446527Z         contiguous: bool,
2025-05-07T20:32:35.7446614Z         compiled: bool,
2025-05-07T20:32:35.7446693Z     ) -> None:
2025-05-07T20:32:35.7446788Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7446857Z     
2025-05-07T20:32:35.7447033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7447103Z     
2025-05-07T20:32:35.7447196Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7447317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7447403Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7447529Z         x0 = x[:, :D]
2025-05-07T20:32:35.7447608Z         x1 = x[:, D:]
2025-05-07T20:32:35.7447683Z     
2025-05-07T20:32:35.7447766Z         if contiguous:
2025-05-07T20:32:35.7447859Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7447951Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7448021Z     
2025-05-07T20:32:35.7448110Z         if scale_ub is not None:
2025-05-07T20:32:35.7448217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7448350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7448427Z             )
2025-05-07T20:32:35.7448504Z         else:
2025-05-07T20:32:35.7448598Z             scale_ub_tensor = None
2025-05-07T20:32:35.7448670Z     
2025-05-07T20:32:35.7448805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7448892Z             op = silu_mul_quant
2025-05-07T20:32:35.7448978Z             if compiled:
2025-05-07T20:32:35.7449075Z                 op = torch.compile(op)
2025-05-07T20:32:35.7449187Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7449257Z     
2025-05-07T20:32:35.7449348Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7449468Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7449540Z     
2025-05-07T20:32:35.7449674Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7449777Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7449880Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7450001Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7450141Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7450220Z     
2025-05-07T20:32:35.7450321Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7450325Z 
2025-05-07T20:32:35.7450425Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7450550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7450711Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7450854Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7451411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7460896Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7461289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7461515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7462012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7462272Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7462675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7462933Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7463318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7463484Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7463821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7463900Z     fn()
2025-05-07T20:32:35.7464298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7464392Z     self.fn.run(
2025-05-07T20:32:35.7464731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7464827Z     kernel = self.compile(
2025-05-07T20:32:35.7465264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7465443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7465570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7465575Z 
2025-05-07T20:32:35.7465790Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097a373a0>
2025-05-07T20:32:35.7466563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7467071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd097c271c0>}
2025-05-07T20:32:35.7467826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7468020Z context = <triton._C.libtriton.ir.context object at 0x7fd096e9a4f0>
2025-05-07T20:32:35.7468025Z 
2025-05-07T20:32:35.7468189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7468451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7468567Z                            module_map=module_map)
2025-05-07T20:32:35.7468731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7468836Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7468917Z E       ^
2025-05-07T20:32:35.7469270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7469275Z 
2025-05-07T20:32:35.7469691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7469739Z 
2025-05-07T20:32:35.7469846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7470065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7470148Z     T=2048,
2025-05-07T20:32:35.7470224Z     D=5120,
2025-05-07T20:32:35.7470317Z     scale_ub=None,
2025-05-07T20:32:35.7470405Z     contiguous=True,
2025-05-07T20:32:35.7470488Z     compiled=True,
2025-05-07T20:32:35.7470565Z )
2025-05-07T20:32:35.7470782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7471020Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7471062Z 
2025-05-07T20:32:35.7471146Z     @given(
2025-05-07T20:32:35.7471266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7471367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7471490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7471616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7471737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7471814Z     )
2025-05-07T20:32:35.7472056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7472155Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7472234Z         self,
2025-05-07T20:32:35.7472310Z         T: int,
2025-05-07T20:32:35.7472392Z         D: int,
2025-05-07T20:32:35.7472492Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7472586Z         contiguous: bool,
2025-05-07T20:32:35.7472675Z         compiled: bool,
2025-05-07T20:32:35.7472753Z     ) -> None:
2025-05-07T20:32:35.7472856Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7472937Z     
2025-05-07T20:32:35.7473103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7473178Z     
2025-05-07T20:32:35.7473273Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7473445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7473540Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7473621Z         x0 = x[:, :D]
2025-05-07T20:32:35.7473703Z         x1 = x[:, D:]
2025-05-07T20:32:35.7473788Z     
2025-05-07T20:32:35.7473871Z         if contiguous:
2025-05-07T20:32:35.7473965Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7474062Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7474132Z     
2025-05-07T20:32:35.7474224Z         if scale_ub is not None:
2025-05-07T20:32:35.7474337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7474474Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7474550Z             )
2025-05-07T20:32:35.7474634Z         else:
2025-05-07T20:32:35.7474727Z             scale_ub_tensor = None
2025-05-07T20:32:35.7474803Z     
2025-05-07T20:32:35.7474935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7475027Z             op = silu_mul_quant
2025-05-07T20:32:35.7475121Z             if compiled:
2025-05-07T20:32:35.7475226Z                 op = torch.compile(op)
2025-05-07T20:32:35.7475331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7475408Z     
2025-05-07T20:32:35.7475496Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7475617Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7475691Z     
2025-05-07T20:32:35.7475825Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7475927Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7476036Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7476158Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7476304Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7476376Z     
2025-05-07T20:32:35.7476477Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7476482Z 
2025-05-07T20:32:35.7476584Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7476761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7476866Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7477006Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7477570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7477675Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7478029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7478335Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7478713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7478968Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7479381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7479632Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7480002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7480174Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7480511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7480593Z     fn()
2025-05-07T20:32:35.7480997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7481076Z     self.fn.run(
2025-05-07T20:32:35.7481416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7481553Z     kernel = self.compile(
2025-05-07T20:32:35.7481933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7482115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7482240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7482245Z 
2025-05-07T20:32:35.7482449Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09718a200>
2025-05-07T20:32:35.7483224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7483729Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0977cf9a0>}
2025-05-07T20:32:35.7484471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7484661Z context = <triton._C.libtriton.ir.context object at 0x7fd09724faf0>
2025-05-07T20:32:35.7484666Z 
2025-05-07T20:32:35.7484835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7485105Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7485213Z                            module_map=module_map)
2025-05-07T20:32:35.7485387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7485489Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7485564Z E       ^
2025-05-07T20:32:35.7485919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7485968Z 
2025-05-07T20:32:35.7486386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7486391Z 
2025-05-07T20:32:35.7486504Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7486722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7486797Z     T=128,
2025-05-07T20:32:35.7486881Z     D=5120,
2025-05-07T20:32:35.7486965Z     scale_ub=None,
2025-05-07T20:32:35.7487051Z     contiguous=True,
2025-05-07T20:32:35.7487138Z     compiled=True,
2025-05-07T20:32:35.7487252Z )
2025-05-07T20:32:35.7487468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7487673Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7487685Z 
2025-05-07T20:32:35.7487760Z     @given(
2025-05-07T20:32:35.7487877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7487985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7488100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7488217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7488343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7488418Z     )
2025-05-07T20:32:35.7488660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7488755Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7488829Z         self,
2025-05-07T20:32:35.7488906Z         T: int,
2025-05-07T20:32:35.7488989Z         D: int,
2025-05-07T20:32:35.7489087Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7489178Z         contiguous: bool,
2025-05-07T20:32:35.7489266Z         compiled: bool,
2025-05-07T20:32:35.7489345Z     ) -> None:
2025-05-07T20:32:35.7489446Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7489523Z     
2025-05-07T20:32:35.7489691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7489770Z     
2025-05-07T20:32:35.7490126Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7490305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7490401Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7490478Z         x0 = x[:, :D]
2025-05-07T20:32:35.7490554Z         x1 = x[:, D:]
2025-05-07T20:32:35.7490628Z     
2025-05-07T20:32:35.7490710Z         if contiguous:
2025-05-07T20:32:35.7490802Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7490891Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7490958Z     
2025-05-07T20:32:35.7491054Z         if scale_ub is not None:
2025-05-07T20:32:35.7491186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7491341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7491420Z             )
2025-05-07T20:32:35.7491494Z         else:
2025-05-07T20:32:35.7491585Z             scale_ub_tensor = None
2025-05-07T20:32:35.7491659Z     
2025-05-07T20:32:35.7491797Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7491885Z             op = silu_mul_quant
2025-05-07T20:32:35.7491973Z             if compiled:
2025-05-07T20:32:35.7492070Z                 op = torch.compile(op)
2025-05-07T20:32:35.7492177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7492245Z     
2025-05-07T20:32:35.7492333Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7492453Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7492522Z     
2025-05-07T20:32:35.7492656Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7492760Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7492857Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7492977Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7493124Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7493191Z     
2025-05-07T20:32:35.7493287Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7493387Z 
2025-05-07T20:32:35.7493487Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7493619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7493724Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7493855Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7494410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7494580Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7494985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7495211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7495574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7495830Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7496231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7496479Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7496845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7497010Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7497357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7497431Z     fn()
2025-05-07T20:32:35.7497830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7497909Z     self.fn.run(
2025-05-07T20:32:35.7498312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7498406Z     kernel = self.compile(
2025-05-07T20:32:35.7498781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7498955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7499080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7499085Z 
2025-05-07T20:32:35.7499293Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0973da890>
2025-05-07T20:32:35.7500167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7500675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09694c700>}
2025-05-07T20:32:35.7501411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7501598Z context = <triton._C.libtriton.ir.context object at 0x7fd096cbf6b0>
2025-05-07T20:32:35.7501603Z 
2025-05-07T20:32:35.7501768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7502033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7502145Z                            module_map=module_map)
2025-05-07T20:32:35.7502307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7502409Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7502488Z E       ^
2025-05-07T20:32:35.7502839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7502916Z 
2025-05-07T20:32:35.7503327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7503334Z 
2025-05-07T20:32:35.7503433Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7503647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7503722Z     T=4096,
2025-05-07T20:32:35.7503794Z     D=5120,
2025-05-07T20:32:35.7503914Z     scale_ub=None,
2025-05-07T20:32:35.7504001Z     contiguous=True,
2025-05-07T20:32:35.7504078Z     compiled=True,
2025-05-07T20:32:35.7504148Z )
2025-05-07T20:32:35.7504405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7504574Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7504579Z 
2025-05-07T20:32:35.7504652Z     @given(
2025-05-07T20:32:35.7504772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7504869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7504988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7505101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7505214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7505290Z     )
2025-05-07T20:32:35.7505533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7505623Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7505709Z         self,
2025-05-07T20:32:35.7505783Z         T: int,
2025-05-07T20:32:35.7505860Z         D: int,
2025-05-07T20:32:35.7505964Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7506050Z         contiguous: bool,
2025-05-07T20:32:35.7506137Z         compiled: bool,
2025-05-07T20:32:35.7506211Z     ) -> None:
2025-05-07T20:32:35.7506304Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7506379Z     
2025-05-07T20:32:35.7506593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7506664Z     
2025-05-07T20:32:35.7506753Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7506877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7506962Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7507041Z         x0 = x[:, :D]
2025-05-07T20:32:35.7507115Z         x1 = x[:, D:]
2025-05-07T20:32:35.7507187Z     
2025-05-07T20:32:35.7507279Z         if contiguous:
2025-05-07T20:32:35.7507366Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7507459Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7507527Z     
2025-05-07T20:32:35.7507619Z         if scale_ub is not None:
2025-05-07T20:32:35.7507726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7507857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7507929Z             )
2025-05-07T20:32:35.7508006Z         else:
2025-05-07T20:32:35.7508107Z             scale_ub_tensor = None
2025-05-07T20:32:35.7508177Z     
2025-05-07T20:32:35.7508311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7508399Z             op = silu_mul_quant
2025-05-07T20:32:35.7508482Z             if compiled:
2025-05-07T20:32:35.7508583Z                 op = torch.compile(op)
2025-05-07T20:32:35.7508688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7508757Z     
2025-05-07T20:32:35.7508847Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7508967Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7509041Z     
2025-05-07T20:32:35.7509178Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7509280Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7509382Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7509503Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7509641Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7509766Z     
2025-05-07T20:32:35.7509866Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7509870Z 
2025-05-07T20:32:35.7509969Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7510093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7510196Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7510334Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7510883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7511022Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7511415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7511638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7512019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7512267Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7512664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7512916Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7513289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7513463Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7513798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7513869Z     fn()
2025-05-07T20:32:35.7514266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7514393Z     self.fn.run(
2025-05-07T20:32:35.7514731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7514828Z     kernel = self.compile(
2025-05-07T20:32:35.7515198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7515373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7515499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7515507Z 
2025-05-07T20:32:35.7515713Z self = <triton.compiler.compiler.ASTSource object at 0x7fd096e9cf10>
2025-05-07T20:32:35.7516480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7516976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096894280>}
2025-05-07T20:32:35.7517709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7517894Z context = <triton._C.libtriton.ir.context object at 0x7fd096b8e670>
2025-05-07T20:32:35.7517902Z 
2025-05-07T20:32:35.7518062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7518327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7518430Z                            module_map=module_map)
2025-05-07T20:32:35.7518596Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7518695Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7518813Z E       ^
2025-05-07T20:32:35.7519167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7519173Z 
2025-05-07T20:32:35.7519580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7519585Z 
2025-05-07T20:32:35.7519689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7519905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7520020Z     T=16384,
2025-05-07T20:32:35.7520096Z     D=5120,
2025-05-07T20:32:35.7520177Z     scale_ub=None,
2025-05-07T20:32:35.7520303Z     contiguous=True,
2025-05-07T20:32:35.7520386Z     compiled=True,
2025-05-07T20:32:35.7520455Z )
2025-05-07T20:32:35.7520667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7520848Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7520858Z 
2025-05-07T20:32:35.7520931Z     @given(
2025-05-07T20:32:35.7521058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7521172Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7521304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7521431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7521545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7521615Z     )
2025-05-07T20:32:35.7521866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7521961Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7522036Z         self,
2025-05-07T20:32:35.7522112Z         T: int,
2025-05-07T20:32:35.7522183Z         D: int,
2025-05-07T20:32:35.7522280Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7522370Z         contiguous: bool,
2025-05-07T20:32:35.7522454Z         compiled: bool,
2025-05-07T20:32:35.7522538Z     ) -> None:
2025-05-07T20:32:35.7522673Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7522745Z     
2025-05-07T20:32:35.7522916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7522983Z     
2025-05-07T20:32:35.7523072Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7523203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7523287Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7523361Z         x0 = x[:, :D]
2025-05-07T20:32:35.7523445Z         x1 = x[:, D:]
2025-05-07T20:32:35.7523518Z     
2025-05-07T20:32:35.7523599Z         if contiguous:
2025-05-07T20:32:35.7523696Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7523785Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7523857Z     
2025-05-07T20:32:35.7523950Z         if scale_ub is not None:
2025-05-07T20:32:35.7524052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7524195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7524271Z             )
2025-05-07T20:32:35.7524348Z         else:
2025-05-07T20:32:35.7524446Z             scale_ub_tensor = None
2025-05-07T20:32:35.7524513Z     
2025-05-07T20:32:35.7524646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7524734Z             op = silu_mul_quant
2025-05-07T20:32:35.7524819Z             if compiled:
2025-05-07T20:32:35.7524922Z                 op = torch.compile(op)
2025-05-07T20:32:35.7525027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7525100Z     
2025-05-07T20:32:35.7525200Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7525317Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7525395Z     
2025-05-07T20:32:35.7525536Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7525634Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7525735Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7525907Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7526044Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7526122Z     
2025-05-07T20:32:35.7526219Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7526224Z 
2025-05-07T20:32:35.7526321Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7526452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7526554Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7526684Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7527314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7527415Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7527779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7528003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7528370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7528624Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7529012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7529263Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7529641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7529809Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7530154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7530232Z     fn()
2025-05-07T20:32:35.7530689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7530772Z     self.fn.run(
2025-05-07T20:32:35.7531103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7531202Z     kernel = self.compile(
2025-05-07T20:32:35.7531575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7531745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7531879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7531883Z 
2025-05-07T20:32:35.7532082Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09659f310>
2025-05-07T20:32:35.7532849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7533353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096894a60>}
2025-05-07T20:32:35.7534086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7534280Z context = <triton._C.libtriton.ir.context object at 0x7fd09646d0f0>
2025-05-07T20:32:35.7534285Z 
2025-05-07T20:32:35.7534449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7534716Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7534820Z                            module_map=module_map)
2025-05-07T20:32:35.7535026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7535131Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7535205Z E       ^
2025-05-07T20:32:35.7535558Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7535562Z 
2025-05-07T20:32:35.7535966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7535971Z 
2025-05-07T20:32:35.7536070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7536331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7536441Z     T=1,
2025-05-07T20:32:35.7536514Z     D=5120,
2025-05-07T20:32:35.7536598Z     scale_ub=1200.0,
2025-05-07T20:32:35.7536682Z     contiguous=True,
2025-05-07T20:32:35.7536764Z     compiled=True,
2025-05-07T20:32:35.7536835Z )
2025-05-07T20:32:35.7537047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7537215Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7537219Z 
2025-05-07T20:32:35.7537293Z     @given(
2025-05-07T20:32:35.7537407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7537507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7537621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7537735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7537847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7537920Z     )
2025-05-07T20:32:35.7538168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7538261Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7538336Z         self,
2025-05-07T20:32:35.7538410Z         T: int,
2025-05-07T20:32:35.7538481Z         D: int,
2025-05-07T20:32:35.7538577Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7538714Z         contiguous: bool,
2025-05-07T20:32:35.7538799Z         compiled: bool,
2025-05-07T20:32:35.7538876Z     ) -> None:
2025-05-07T20:32:35.7538973Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7539046Z     
2025-05-07T20:32:35.7539212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7539290Z     
2025-05-07T20:32:35.7539383Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7539510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7539598Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7539678Z         x0 = x[:, :D]
2025-05-07T20:32:35.7539912Z         x1 = x[:, D:]
2025-05-07T20:32:35.7539983Z     
2025-05-07T20:32:35.7540068Z         if contiguous:
2025-05-07T20:32:35.7540166Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7540253Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7540320Z     
2025-05-07T20:32:35.7540414Z         if scale_ub is not None:
2025-05-07T20:32:35.7540520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7540659Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7540738Z             )
2025-05-07T20:32:35.7540811Z         else:
2025-05-07T20:32:35.7540903Z             scale_ub_tensor = None
2025-05-07T20:32:35.7540977Z     
2025-05-07T20:32:35.7541103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7541194Z             op = silu_mul_quant
2025-05-07T20:32:35.7541278Z             if compiled:
2025-05-07T20:32:35.7541375Z                 op = torch.compile(op)
2025-05-07T20:32:35.7541487Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7541557Z     
2025-05-07T20:32:35.7541649Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7541654Z 
2025-05-07T20:32:35.7541757Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7541885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7541985Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7542141Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7542501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7542596Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7543084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7543180Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7543535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7543835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7544180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7544274Z     kernel = self.compile(
2025-05-07T20:32:35.7544658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7544839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7544962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7544966Z 
2025-05-07T20:32:35.7545168Z self = <triton.compiler.compiler.ASTSource object at 0x7fd097188640>
2025-05-07T20:32:35.7545933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7546440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09688f1c0>}
2025-05-07T20:32:35.7547214Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7547405Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ebe270>
2025-05-07T20:32:35.7547410Z 
2025-05-07T20:32:35.7547572Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7547828Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7547932Z                            module_map=module_map)
2025-05-07T20:32:35.7548296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7548396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7548469Z E       ^
2025-05-07T20:32:35.7548823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7548828Z 
2025-05-07T20:32:35.7549240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7549250Z 
2025-05-07T20:32:35.7549356Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7549574Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7549645Z     T=1,
2025-05-07T20:32:35.7549722Z     D=5120,
2025-05-07T20:32:35.7549801Z     scale_ub=None,
2025-05-07T20:32:35.7549882Z     contiguous=False,
2025-05-07T20:32:35.7549967Z     compiled=True,
2025-05-07T20:32:35.7550039Z )
2025-05-07T20:32:35.7550254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7550417Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7550421Z 
2025-05-07T20:32:35.7550496Z     @given(
2025-05-07T20:32:35.7550616Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7550714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7550829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7551001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7551114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7551181Z     )
2025-05-07T20:32:35.7551427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7551516Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7551590Z         self,
2025-05-07T20:32:35.7551662Z         T: int,
2025-05-07T20:32:35.7551736Z         D: int,
2025-05-07T20:32:35.7551839Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7551926Z         contiguous: bool,
2025-05-07T20:32:35.7552051Z         compiled: bool,
2025-05-07T20:32:35.7552129Z     ) -> None:
2025-05-07T20:32:35.7552258Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7552324Z     
2025-05-07T20:32:35.7552496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7552564Z     
2025-05-07T20:32:35.7552652Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7552778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7552868Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7552951Z         x0 = x[:, :D]
2025-05-07T20:32:35.7553026Z         x1 = x[:, D:]
2025-05-07T20:32:35.7553096Z     
2025-05-07T20:32:35.7553179Z         if contiguous:
2025-05-07T20:32:35.7553269Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7553353Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7553426Z     
2025-05-07T20:32:35.7553510Z         if scale_ub is not None:
2025-05-07T20:32:35.7553611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7553751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7553825Z             )
2025-05-07T20:32:35.7553900Z         else:
2025-05-07T20:32:35.7554004Z             scale_ub_tensor = None
2025-05-07T20:32:35.7554074Z     
2025-05-07T20:32:35.7554201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7554291Z             op = silu_mul_quant
2025-05-07T20:32:35.7554376Z             if compiled:
2025-05-07T20:32:35.7554521Z                 op = torch.compile(op)
2025-05-07T20:32:35.7554626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7554697Z     
2025-05-07T20:32:35.7554789Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7554908Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7554980Z     
2025-05-07T20:32:35.7555119Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7555217Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7555316Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7555446Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7555586Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7555659Z     
2025-05-07T20:32:35.7555758Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7555762Z 
2025-05-07T20:32:35.7555856Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7555989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7556089Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7556220Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7556769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7556865Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7557220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7557443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7557804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7558057Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7558498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7558748Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7559116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7559277Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7559621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7559736Z     fn()
2025-05-07T20:32:35.7560194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7560277Z     self.fn.run(
2025-05-07T20:32:35.7560616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7560716Z     kernel = self.compile(
2025-05-07T20:32:35.7561144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7561327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7561454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7561458Z 
2025-05-07T20:32:35.7561661Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09659ec20>
2025-05-07T20:32:35.7562426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7562922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09688e680>}
2025-05-07T20:32:35.7563697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7563886Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ef73b0>
2025-05-07T20:32:35.7563891Z 
2025-05-07T20:32:35.7564053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7564313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7564421Z                            module_map=module_map)
2025-05-07T20:32:35.7564579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7564683Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7564756Z E       ^
2025-05-07T20:32:35.7565103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7565112Z 
2025-05-07T20:32:35.7565529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7565534Z 
2025-05-07T20:32:35.7565634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7565854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7565928Z     T=1,
2025-05-07T20:32:35.7566000Z     D=5120,
2025-05-07T20:32:35.7566080Z     scale_ub=None,
2025-05-07T20:32:35.7566162Z     contiguous=True,
2025-05-07T20:32:35.7566241Z     compiled=False,
2025-05-07T20:32:35.7566316Z )
2025-05-07T20:32:35.7566528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7566698Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.7566703Z 
2025-05-07T20:32:35.7566778Z     @given(
2025-05-07T20:32:35.7566894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7566994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7567156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7567271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7567389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7567460Z     )
2025-05-07T20:32:35.7567701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7567792Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7567864Z         self,
2025-05-07T20:32:35.7567944Z         T: int,
2025-05-07T20:32:35.7568015Z         D: int,
2025-05-07T20:32:35.7568157Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7568245Z         contiguous: bool,
2025-05-07T20:32:35.7568365Z         compiled: bool,
2025-05-07T20:32:35.7568442Z     ) -> None:
2025-05-07T20:32:35.7568540Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7568608Z     
2025-05-07T20:32:35.7568774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7568853Z     
2025-05-07T20:32:35.7568945Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7569067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7569156Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7569233Z         x0 = x[:, :D]
2025-05-07T20:32:35.7569312Z         x1 = x[:, D:]
2025-05-07T20:32:35.7569383Z     
2025-05-07T20:32:35.7569465Z         if contiguous:
2025-05-07T20:32:35.7569560Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7569645Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7569717Z     
2025-05-07T20:32:35.7569813Z         if scale_ub is not None:
2025-05-07T20:32:35.7569915Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7570050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7570125Z             )
2025-05-07T20:32:35.7570199Z         else:
2025-05-07T20:32:35.7570294Z             scale_ub_tensor = None
2025-05-07T20:32:35.7570364Z     
2025-05-07T20:32:35.7570490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7570628Z             op = silu_mul_quant
2025-05-07T20:32:35.7570723Z             if compiled:
2025-05-07T20:32:35.7570820Z                 op = torch.compile(op)
2025-05-07T20:32:35.7570927Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7570994Z     
2025-05-07T20:32:35.7571085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7571090Z 
2025-05-07T20:32:35.7571188Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7571313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7571414Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7571513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7572006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7572104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7572458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7572676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7573010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7573100Z     kernel = self.compile(
2025-05-07T20:32:35.7573476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7573651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7573776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7573782Z 
2025-05-07T20:32:35.7573987Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09670b7c0>
2025-05-07T20:32:35.7574749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7575300Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096dbd900>}
2025-05-07T20:32:35.7576041Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7576271Z context = <triton._C.libtriton.ir.context object at 0x7fcec1dfc2b0>
2025-05-07T20:32:35.7576275Z 
2025-05-07T20:32:35.7576478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7576741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7576846Z                            module_map=module_map)
2025-05-07T20:32:35.7577005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7577107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7577184Z E       ^
2025-05-07T20:32:35.7577533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7577537Z 
2025-05-07T20:32:35.7577948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7577953Z 
2025-05-07T20:32:35.7578060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7578280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7578355Z     T=128,
2025-05-07T20:32:35.7578431Z     D=5120,
2025-05-07T20:32:35.7578508Z     scale_ub=None,
2025-05-07T20:32:35.7578593Z     contiguous=False,
2025-05-07T20:32:35.7578669Z     compiled=True,
2025-05-07T20:32:35.7578737Z )
2025-05-07T20:32:35.7578953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7579165Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7579170Z 
2025-05-07T20:32:35.7579243Z     @given(
2025-05-07T20:32:35.7579364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7579459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7579580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7579692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7579916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7579997Z     )
2025-05-07T20:32:35.7580238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7580331Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7580413Z         self,
2025-05-07T20:32:35.7580483Z         T: int,
2025-05-07T20:32:35.7580555Z         D: int,
2025-05-07T20:32:35.7580655Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7580743Z         contiguous: bool,
2025-05-07T20:32:35.7580829Z         compiled: bool,
2025-05-07T20:32:35.7580910Z     ) -> None:
2025-05-07T20:32:35.7581002Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7581075Z     
2025-05-07T20:32:35.7581242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7581311Z     
2025-05-07T20:32:35.7581402Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7581526Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7581614Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7581701Z         x0 = x[:, :D]
2025-05-07T20:32:35.7581774Z         x1 = x[:, D:]
2025-05-07T20:32:35.7581839Z     
2025-05-07T20:32:35.7581922Z         if contiguous:
2025-05-07T20:32:35.7582013Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7582101Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7582175Z     
2025-05-07T20:32:35.7582263Z         if scale_ub is not None:
2025-05-07T20:32:35.7582370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7582631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7582702Z             )
2025-05-07T20:32:35.7582780Z         else:
2025-05-07T20:32:35.7582874Z             scale_ub_tensor = None
2025-05-07T20:32:35.7582942Z     
2025-05-07T20:32:35.7583075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7583164Z             op = silu_mul_quant
2025-05-07T20:32:35.7583248Z             if compiled:
2025-05-07T20:32:35.7583350Z                 op = torch.compile(op)
2025-05-07T20:32:35.7583497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7583563Z     
2025-05-07T20:32:35.7583655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7583696Z 
2025-05-07T20:32:35.7583794Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7583924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7584026Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7587545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7587943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7588038Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7588529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7588626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7588984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7589209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7589550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7589649Z     kernel = self.compile(
2025-05-07T20:32:35.7590298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7590588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7590721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7590726Z 
2025-05-07T20:32:35.7590934Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1ec7f70>
2025-05-07T20:32:35.7591698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7592200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096dbfeb0>}
2025-05-07T20:32:35.7592942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7593138Z context = <triton._C.libtriton.ir.context object at 0x7fcec1d228f0>
2025-05-07T20:32:35.7593142Z 
2025-05-07T20:32:35.7593308Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7593575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7593681Z                            module_map=module_map)
2025-05-07T20:32:35.7593839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7593944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7594016Z E       ^
2025-05-07T20:32:35.7594367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7594376Z 
2025-05-07T20:32:35.7594784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7594851Z 
2025-05-07T20:32:35.7594960Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7595184Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7595255Z     T=128,
2025-05-07T20:32:35.7595326Z     D=7168,
2025-05-07T20:32:35.7595409Z     scale_ub=1200.0,
2025-05-07T20:32:35.7595492Z     contiguous=False,
2025-05-07T20:32:35.7595574Z     compiled=False,
2025-05-07T20:32:35.7595648Z )
2025-05-07T20:32:35.7595861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7596127Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7596132Z 
2025-05-07T20:32:35.7596267Z     @given(
2025-05-07T20:32:35.7596389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7596489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7596604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7596719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7596841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7596914Z     )
2025-05-07T20:32:35.7597162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7597253Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7597328Z         self,
2025-05-07T20:32:35.7597409Z         T: int,
2025-05-07T20:32:35.7597484Z         D: int,
2025-05-07T20:32:35.7597582Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7597672Z         contiguous: bool,
2025-05-07T20:32:35.7597757Z         compiled: bool,
2025-05-07T20:32:35.7597834Z     ) -> None:
2025-05-07T20:32:35.7597933Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7598006Z     
2025-05-07T20:32:35.7598174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7598253Z     
2025-05-07T20:32:35.7598346Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7598468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7598619Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7598703Z         x0 = x[:, :D]
2025-05-07T20:32:35.7598786Z         x1 = x[:, D:]
2025-05-07T20:32:35.7598858Z     
2025-05-07T20:32:35.7598940Z         if contiguous:
2025-05-07T20:32:35.7599037Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7599127Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7599199Z     
2025-05-07T20:32:35.7599296Z         if scale_ub is not None:
2025-05-07T20:32:35.7599406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7599545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7599628Z             )
2025-05-07T20:32:35.7599707Z         else:
2025-05-07T20:32:35.7599802Z             scale_ub_tensor = None
2025-05-07T20:32:35.7599880Z     
2025-05-07T20:32:35.7600012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7600102Z             op = silu_mul_quant
2025-05-07T20:32:35.7600189Z             if compiled:
2025-05-07T20:32:35.7600297Z                 op = torch.compile(op)
2025-05-07T20:32:35.7600408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7600478Z     
2025-05-07T20:32:35.7600570Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7600574Z 
2025-05-07T20:32:35.7600676Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7600803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7600901Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7601002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7601509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7601611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7601965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7602186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7602580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7602674Z     kernel = self.compile(
2025-05-07T20:32:35.7603051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7603226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7603353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7603400Z 
2025-05-07T20:32:35.7603610Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1dc44c0>
2025-05-07T20:32:35.7604414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7604929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096dbd7e0>}
2025-05-07T20:32:35.7605669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7605860Z context = <triton._C.libtriton.ir.context object at 0x7fcec1d695b0>
2025-05-07T20:32:35.7605865Z 
2025-05-07T20:32:35.7606032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7606297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7606410Z                            module_map=module_map)
2025-05-07T20:32:35.7606573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7606671Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7606756Z E       ^
2025-05-07T20:32:35.7607149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7607155Z 
2025-05-07T20:32:35.7607573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7607583Z 
2025-05-07T20:32:35.7607693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7607911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7607996Z     T=128,
2025-05-07T20:32:35.7608072Z     D=5120,
2025-05-07T20:32:35.7608153Z     scale_ub=None,
2025-05-07T20:32:35.7608241Z     contiguous=False,
2025-05-07T20:32:35.7608327Z     compiled=False,
2025-05-07T20:32:35.7608399Z )
2025-05-07T20:32:35.7608615Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7608785Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.7608792Z 
2025-05-07T20:32:35.7608873Z     @given(
2025-05-07T20:32:35.7608994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7609093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7609212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7609335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7609449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7609528Z     )
2025-05-07T20:32:35.7609772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7609867Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7609947Z         self,
2025-05-07T20:32:35.7610025Z         T: int,
2025-05-07T20:32:35.7610101Z         D: int,
2025-05-07T20:32:35.7610204Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7610292Z         contiguous: bool,
2025-05-07T20:32:35.7610385Z         compiled: bool,
2025-05-07T20:32:35.7610466Z     ) -> None:
2025-05-07T20:32:35.7610609Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7610684Z     
2025-05-07T20:32:35.7610852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7610927Z     
2025-05-07T20:32:35.7611026Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7611148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7611235Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7611320Z         x0 = x[:, :D]
2025-05-07T20:32:35.7611406Z         x1 = x[:, D:]
2025-05-07T20:32:35.7611476Z     
2025-05-07T20:32:35.7611606Z         if contiguous:
2025-05-07T20:32:35.7611703Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7611789Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7611898Z     
2025-05-07T20:32:35.7611995Z         if scale_ub is not None:
2025-05-07T20:32:35.7612100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7612234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7612315Z             )
2025-05-07T20:32:35.7612393Z         else:
2025-05-07T20:32:35.7612490Z             scale_ub_tensor = None
2025-05-07T20:32:35.7612560Z     
2025-05-07T20:32:35.7612687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7612777Z             op = silu_mul_quant
2025-05-07T20:32:35.7612861Z             if compiled:
2025-05-07T20:32:35.7612960Z                 op = torch.compile(op)
2025-05-07T20:32:35.7613073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7613146Z     
2025-05-07T20:32:35.7613236Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7613243Z 
2025-05-07T20:32:35.7613341Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7613471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7613576Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7613679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7614216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7614322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7614677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7614895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7615237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7615331Z     kernel = self.compile(
2025-05-07T20:32:35.7615715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7615892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7616017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7616022Z 
2025-05-07T20:32:35.7616231Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1e822c0>
2025-05-07T20:32:35.7617000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7617497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09696beb0>}
2025-05-07T20:32:35.7618236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7618427Z context = <triton._C.libtriton.ir.context object at 0x7fd0961a4f70>
2025-05-07T20:32:35.7618437Z 
2025-05-07T20:32:35.7618603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7618862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7619019Z                            module_map=module_map)
2025-05-07T20:32:35.7619178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7619275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7619353Z E       ^
2025-05-07T20:32:35.7619704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7619709Z 
2025-05-07T20:32:35.7620296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7620348Z 
2025-05-07T20:32:35.7620492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7620714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7620795Z     T=128,
2025-05-07T20:32:35.7620878Z     D=5120,
2025-05-07T20:32:35.7620980Z     scale_ub=1200.0,
2025-05-07T20:32:35.7621085Z     contiguous=True,
2025-05-07T20:32:35.7621182Z     compiled=False,
2025-05-07T20:32:35.7621255Z )
2025-05-07T20:32:35.7621477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7621644Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.7621648Z 
2025-05-07T20:32:35.7621727Z     @given(
2025-05-07T20:32:35.7621844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7621942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7622063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7622180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7622297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7622378Z     )
2025-05-07T20:32:35.7622623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7622715Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7622800Z         self,
2025-05-07T20:32:35.7622918Z         T: int,
2025-05-07T20:32:35.7622999Z         D: int,
2025-05-07T20:32:35.7623096Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7623184Z         contiguous: bool,
2025-05-07T20:32:35.7623275Z         compiled: bool,
2025-05-07T20:32:35.7623352Z     ) -> None:
2025-05-07T20:32:35.7623444Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7623520Z     
2025-05-07T20:32:35.7623690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7623762Z     
2025-05-07T20:32:35.7623864Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7623987Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7624079Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7624163Z         x0 = x[:, :D]
2025-05-07T20:32:35.7624243Z         x1 = x[:, D:]
2025-05-07T20:32:35.7624318Z     
2025-05-07T20:32:35.7624402Z         if contiguous:
2025-05-07T20:32:35.7624493Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7624588Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7624661Z     
2025-05-07T20:32:35.7624752Z         if scale_ub is not None:
2025-05-07T20:32:35.7624860Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7624996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7625070Z             )
2025-05-07T20:32:35.7625152Z         else:
2025-05-07T20:32:35.7625245Z             scale_ub_tensor = None
2025-05-07T20:32:35.7625316Z     
2025-05-07T20:32:35.7625449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7625542Z             op = silu_mul_quant
2025-05-07T20:32:35.7625627Z             if compiled:
2025-05-07T20:32:35.7625731Z                 op = torch.compile(op)
2025-05-07T20:32:35.7625840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7625917Z     
2025-05-07T20:32:35.7626007Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7626011Z 
2025-05-07T20:32:35.7626106Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7626306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7626408Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7626504Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7627001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7627099Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7627457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7627724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7628104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7628204Z     kernel = self.compile(
2025-05-07T20:32:35.7628589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7628769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7628899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7628903Z 
2025-05-07T20:32:35.7629109Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1d919c0>
2025-05-07T20:32:35.7629880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7630380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09696ad40>}
2025-05-07T20:32:35.7631165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7631358Z context = <triton._C.libtriton.ir.context object at 0x7fd0961505f0>
2025-05-07T20:32:35.7631362Z 
2025-05-07T20:32:35.7631524Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7631788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7631894Z                            module_map=module_map)
2025-05-07T20:32:35.7632057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7632156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7632232Z E       ^
2025-05-07T20:32:35.7632588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7632593Z 
2025-05-07T20:32:35.7633009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7633016Z 
2025-05-07T20:32:35.7633122Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7633343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7633422Z     T=1,
2025-05-07T20:32:35.7633504Z     D=7168,
2025-05-07T20:32:35.7633585Z     scale_ub=1200.0,
2025-05-07T20:32:35.7633669Z     contiguous=True,
2025-05-07T20:32:35.7633753Z     compiled=True,
2025-05-07T20:32:35.7633826Z )
2025-05-07T20:32:35.7634038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7634210Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7634215Z 
2025-05-07T20:32:35.7634296Z     @given(
2025-05-07T20:32:35.7634414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7634517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7634632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7634796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7634916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7634988Z     )
2025-05-07T20:32:35.7635239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7635332Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7635408Z         self,
2025-05-07T20:32:35.7635488Z         T: int,
2025-05-07T20:32:35.7635563Z         D: int,
2025-05-07T20:32:35.7635661Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7635754Z         contiguous: bool,
2025-05-07T20:32:35.7635885Z         compiled: bool,
2025-05-07T20:32:35.7635961Z     ) -> None:
2025-05-07T20:32:35.7636061Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7636172Z     
2025-05-07T20:32:35.7636345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7636419Z     
2025-05-07T20:32:35.7636511Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7636638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7636732Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7636811Z         x0 = x[:, :D]
2025-05-07T20:32:35.7636896Z         x1 = x[:, D:]
2025-05-07T20:32:35.7636969Z     
2025-05-07T20:32:35.7637052Z         if contiguous:
2025-05-07T20:32:35.7637146Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7637235Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7637307Z     
2025-05-07T20:32:35.7637401Z         if scale_ub is not None:
2025-05-07T20:32:35.7637505Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7637644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7637720Z             )
2025-05-07T20:32:35.7637801Z         else:
2025-05-07T20:32:35.7637897Z             scale_ub_tensor = None
2025-05-07T20:32:35.7637968Z     
2025-05-07T20:32:35.7638095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7638189Z             op = silu_mul_quant
2025-05-07T20:32:35.7638278Z             if compiled:
2025-05-07T20:32:35.7638421Z                 op = torch.compile(op)
2025-05-07T20:32:35.7638530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7638600Z     
2025-05-07T20:32:35.7638689Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7638694Z 
2025-05-07T20:32:35.7638794Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7638919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7639027Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7639125Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7639491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7639590Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7640085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7640181Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7640543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7640769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7641107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7641200Z     kernel = self.compile(
2025-05-07T20:32:35.7641578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7641759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7641886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7641890Z 
2025-05-07T20:32:35.7642095Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0961d2290>
2025-05-07T20:32:35.7642866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7643413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096969480>}
2025-05-07T20:32:35.7644155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7644385Z context = <triton._C.libtriton.ir.context object at 0x7fcec1f693b0>
2025-05-07T20:32:35.7644426Z 
2025-05-07T20:32:35.7644595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7644856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7644963Z                            module_map=module_map)
2025-05-07T20:32:35.7645134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7645231Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7645311Z E       ^
2025-05-07T20:32:35.7645661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7645665Z 
2025-05-07T20:32:35.7646080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7646087Z 
2025-05-07T20:32:35.7646198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7646419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7646494Z     T=1,
2025-05-07T20:32:35.7646572Z     D=7168,
2025-05-07T20:32:35.7646655Z     scale_ub=1200.0,
2025-05-07T20:32:35.7646749Z     contiguous=False,
2025-05-07T20:32:35.7646833Z     compiled=True,
2025-05-07T20:32:35.7646908Z )
2025-05-07T20:32:35.7647168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7647336Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7647341Z 
2025-05-07T20:32:35.7647416Z     @given(
2025-05-07T20:32:35.7647539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7647636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7647751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7647871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7647989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7648067Z     )
2025-05-07T20:32:35.7648316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7648408Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7648487Z         self,
2025-05-07T20:32:35.7648561Z         T: int,
2025-05-07T20:32:35.7648638Z         D: int,
2025-05-07T20:32:35.7648746Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7648836Z         contiguous: bool,
2025-05-07T20:32:35.7648927Z         compiled: bool,
2025-05-07T20:32:35.7649007Z     ) -> None:
2025-05-07T20:32:35.7649103Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7649180Z     
2025-05-07T20:32:35.7649347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7649422Z     
2025-05-07T20:32:35.7649518Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7649644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7649736Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7649821Z         x0 = x[:, :D]
2025-05-07T20:32:35.7649902Z         x1 = x[:, D:]
2025-05-07T20:32:35.7649983Z     
2025-05-07T20:32:35.7650068Z         if contiguous:
2025-05-07T20:32:35.7650159Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7650251Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7650322Z     
2025-05-07T20:32:35.7650415Z         if scale_ub is not None:
2025-05-07T20:32:35.7650577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7650714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7650789Z             )
2025-05-07T20:32:35.7650868Z         else:
2025-05-07T20:32:35.7650961Z             scale_ub_tensor = None
2025-05-07T20:32:35.7651032Z     
2025-05-07T20:32:35.7651163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7651249Z             op = silu_mul_quant
2025-05-07T20:32:35.7651336Z             if compiled:
2025-05-07T20:32:35.7651478Z                 op = torch.compile(op)
2025-05-07T20:32:35.7651585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7651660Z     
2025-05-07T20:32:35.7651790Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7651795Z 
2025-05-07T20:32:35.7651892Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7652025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7652129Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7652231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7652598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7652691Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7653193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7653289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7653648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7653880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7654217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7654310Z     kernel = self.compile(
2025-05-07T20:32:35.7654756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7654936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7655066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7655070Z 
2025-05-07T20:32:35.7655274Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0961034f0>
2025-05-07T20:32:35.7656041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7656558Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd096968940>}
2025-05-07T20:32:35.7657300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7657495Z context = <triton._C.libtriton.ir.context object at 0x7fcec1fc84f0>
2025-05-07T20:32:35.7657500Z 
2025-05-07T20:32:35.7657661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7657926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7658033Z                            module_map=module_map)
2025-05-07T20:32:35.7658199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7658299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7658378Z E       ^
2025-05-07T20:32:35.7658728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7658733Z 
2025-05-07T20:32:35.7659150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7659201Z 
2025-05-07T20:32:35.7659305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7659529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7659605Z     T=1,
2025-05-07T20:32:35.7659680Z     D=7168,
2025-05-07T20:32:35.7659898Z     scale_ub=None,
2025-05-07T20:32:35.7659985Z     contiguous=False,
2025-05-07T20:32:35.7660066Z     compiled=True,
2025-05-07T20:32:35.7660139Z )
2025-05-07T20:32:35.7660351Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7660560Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7660606Z 
2025-05-07T20:32:35.7660683Z     @given(
2025-05-07T20:32:35.7660803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7660905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7661019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7661142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7661259Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7661335Z     )
2025-05-07T20:32:35.7661581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7661677Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7661751Z         self,
2025-05-07T20:32:35.7661825Z         T: int,
2025-05-07T20:32:35.7661903Z         D: int,
2025-05-07T20:32:35.7662002Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7662100Z         contiguous: bool,
2025-05-07T20:32:35.7662186Z         compiled: bool,
2025-05-07T20:32:35.7662264Z     ) -> None:
2025-05-07T20:32:35.7662364Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7662437Z     
2025-05-07T20:32:35.7662606Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7662684Z     
2025-05-07T20:32:35.7662777Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7662947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7663039Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7663119Z         x0 = x[:, :D]
2025-05-07T20:32:35.7663197Z         x1 = x[:, D:]
2025-05-07T20:32:35.7663277Z     
2025-05-07T20:32:35.7663361Z         if contiguous:
2025-05-07T20:32:35.7663454Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7663542Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7663616Z     
2025-05-07T20:32:35.7663714Z         if scale_ub is not None:
2025-05-07T20:32:35.7663823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7663957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7664039Z             )
2025-05-07T20:32:35.7664117Z         else:
2025-05-07T20:32:35.7664211Z             scale_ub_tensor = None
2025-05-07T20:32:35.7664288Z     
2025-05-07T20:32:35.7664417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7664504Z             op = silu_mul_quant
2025-05-07T20:32:35.7664600Z             if compiled:
2025-05-07T20:32:35.7664700Z                 op = torch.compile(op)
2025-05-07T20:32:35.7664814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7664886Z     
2025-05-07T20:32:35.7664977Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.7665102Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.7665171Z     
2025-05-07T20:32:35.7665306Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7665411Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.7665514Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.7665638Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.7665783Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7665854Z     
2025-05-07T20:32:35.7665954Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.7665964Z 
2025-05-07T20:32:35.7666064Z moe/activation_test.py:126: 
2025-05-07T20:32:35.7666243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7666353Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.7666485Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.7667040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.7667145Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.7667498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7667801Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7668170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.7668423Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7668832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.7669080Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.7669455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.7669621Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.7669962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.7670043Z     fn()
2025-05-07T20:32:35.7670447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.7670528Z     self.fn.run(
2025-05-07T20:32:35.7670865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7671002Z     kernel = self.compile(
2025-05-07T20:32:35.7671386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7671564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7671689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7671693Z 
2025-05-07T20:32:35.7671902Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0961f2650>
2025-05-07T20:32:35.7672680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7673187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0965a2ef0>}
2025-05-07T20:32:35.7673930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7674121Z context = <triton._C.libtriton.ir.context object at 0x7fcec1fd44f0>
2025-05-07T20:32:35.7674125Z 
2025-05-07T20:32:35.7674291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7674551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7674662Z                            module_map=module_map)
2025-05-07T20:32:35.7674826Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7674925Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.7675003Z E       ^
2025-05-07T20:32:35.7675353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7675400Z 
2025-05-07T20:32:35.7675816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7675823Z 
2025-05-07T20:32:35.7675925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7676144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7676222Z     T=1,
2025-05-07T20:32:35.7676298Z     D=5120,
2025-05-07T20:32:35.7676381Z     scale_ub=1200.0,
2025-05-07T20:32:35.7676470Z     contiguous=False,
2025-05-07T20:32:35.7676597Z     compiled=True,
2025-05-07T20:32:35.7676669Z )
2025-05-07T20:32:35.7676926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7677093Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7677098Z 
2025-05-07T20:32:35.7677177Z     @given(
2025-05-07T20:32:35.7677294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7677398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7677515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7677631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7677745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7677822Z     )
2025-05-07T20:32:35.7678064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7678157Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7678235Z         self,
2025-05-07T20:32:35.7678309Z         T: int,
2025-05-07T20:32:35.7678388Z         D: int,
2025-05-07T20:32:35.7678489Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7678582Z         contiguous: bool,
2025-05-07T20:32:35.7678670Z         compiled: bool,
2025-05-07T20:32:35.7678747Z     ) -> None:
2025-05-07T20:32:35.7678841Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7678915Z     
2025-05-07T20:32:35.7679082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7679200Z     
2025-05-07T20:32:35.7679295Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7679419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7679508Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7679592Z         x0 = x[:, :D]
2025-05-07T20:32:35.7679671Z         x1 = x[:, D:]
2025-05-07T20:32:35.7679739Z     
2025-05-07T20:32:35.7679825Z         if contiguous:
2025-05-07T20:32:35.7679921Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7680011Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7680086Z     
2025-05-07T20:32:35.7680176Z         if scale_ub is not None:
2025-05-07T20:32:35.7680283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7680419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7680494Z             )
2025-05-07T20:32:35.7680571Z         else:
2025-05-07T20:32:35.7680664Z             scale_ub_tensor = None
2025-05-07T20:32:35.7680735Z     
2025-05-07T20:32:35.7680874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7680962Z             op = silu_mul_quant
2025-05-07T20:32:35.7681045Z             if compiled:
2025-05-07T20:32:35.7681145Z                 op = torch.compile(op)
2025-05-07T20:32:35.7681249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7681319Z     
2025-05-07T20:32:35.7681412Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7681417Z 
2025-05-07T20:32:35.7681512Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7681639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7681741Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7681840Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7682205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7682297Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7682788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7682935Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7683295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7683522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7683857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7683990Z     kernel = self.compile(
2025-05-07T20:32:35.7684439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7684614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7684737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7684742Z 
2025-05-07T20:32:35.7684951Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1f56b90>
2025-05-07T20:32:35.7685711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7686203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0965a3eb0>}
2025-05-07T20:32:35.7686940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7687127Z context = <triton._C.libtriton.ir.context object at 0x7fd09634e270>
2025-05-07T20:32:35.7687132Z 
2025-05-07T20:32:35.7687291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7687594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7687704Z                            module_map=module_map)
2025-05-07T20:32:35.7687862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7687959Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7688033Z E       ^
2025-05-07T20:32:35.7688379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7688384Z 
2025-05-07T20:32:35.7688801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7688809Z 
2025-05-07T20:32:35.7688910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7689130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7689203Z     T=1,
2025-05-07T20:32:35.7689276Z     D=5120,
2025-05-07T20:32:35.7689361Z     scale_ub=1200.0,
2025-05-07T20:32:35.7689446Z     contiguous=False,
2025-05-07T20:32:35.7689527Z     compiled=False,
2025-05-07T20:32:35.7689598Z )
2025-05-07T20:32:35.7689808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7690217Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7690224Z 
2025-05-07T20:32:35.7690303Z     @given(
2025-05-07T20:32:35.7690419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7690522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7690641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7690760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7690873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7690943Z     )
2025-05-07T20:32:35.7691216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7691418Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7691494Z         self,
2025-05-07T20:32:35.7691566Z         T: int,
2025-05-07T20:32:35.7691639Z         D: int,
2025-05-07T20:32:35.7691735Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7691825Z         contiguous: bool,
2025-05-07T20:32:35.7691914Z         compiled: bool,
2025-05-07T20:32:35.7691990Z     ) -> None:
2025-05-07T20:32:35.7692085Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7692155Z     
2025-05-07T20:32:35.7692319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7692454Z     
2025-05-07T20:32:35.7692544Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7692734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7692822Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7692898Z         x0 = x[:, :D]
2025-05-07T20:32:35.7692976Z         x1 = x[:, D:]
2025-05-07T20:32:35.7693049Z     
2025-05-07T20:32:35.7693130Z         if contiguous:
2025-05-07T20:32:35.7693224Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7693316Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7693383Z     
2025-05-07T20:32:35.7693469Z         if scale_ub is not None:
2025-05-07T20:32:35.7693577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7693708Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7693786Z             )
2025-05-07T20:32:35.7693859Z         else:
2025-05-07T20:32:35.7693951Z             scale_ub_tensor = None
2025-05-07T20:32:35.7694024Z     
2025-05-07T20:32:35.7694153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7694247Z             op = silu_mul_quant
2025-05-07T20:32:35.7694332Z             if compiled:
2025-05-07T20:32:35.7694431Z                 op = torch.compile(op)
2025-05-07T20:32:35.7694536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7694610Z     
2025-05-07T20:32:35.7694698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7694703Z 
2025-05-07T20:32:35.7694804Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7694986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7695087Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7695184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7695673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7695765Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7696118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7696345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7696683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7696775Z     kernel = self.compile(
2025-05-07T20:32:35.7697158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7697336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7697457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7697462Z 
2025-05-07T20:32:35.7697658Z self = <triton.compiler.compiler.ASTSource object at 0x7fd096379210>
2025-05-07T20:32:35.7698427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7698932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd09696be20>}
2025-05-07T20:32:35.7699670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7700014Z context = <triton._C.libtriton.ir.context object at 0x7fd0963439b0>
2025-05-07T20:32:35.7700020Z 
2025-05-07T20:32:35.7700185Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7700440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7700544Z                            module_map=module_map)
2025-05-07T20:32:35.7700751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7700846Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7700917Z E       ^
2025-05-07T20:32:35.7701309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7701314Z 
2025-05-07T20:32:35.7701724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7701731Z 
2025-05-07T20:32:35.7701834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7702051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7702122Z     T=16384,
2025-05-07T20:32:35.7702199Z     D=5120,
2025-05-07T20:32:35.7702275Z     scale_ub=1200.0,
2025-05-07T20:32:35.7702357Z     contiguous=False,
2025-05-07T20:32:35.7702439Z     compiled=True,
2025-05-07T20:32:35.7702507Z )
2025-05-07T20:32:35.7702720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7702899Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7702907Z 
2025-05-07T20:32:35.7702978Z     @given(
2025-05-07T20:32:35.7703101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7703198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7703309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7703469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7703582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7703655Z     )
2025-05-07T20:32:35.7703896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7703986Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7704060Z         self,
2025-05-07T20:32:35.7704130Z         T: int,
2025-05-07T20:32:35.7704201Z         D: int,
2025-05-07T20:32:35.7704299Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7704386Z         contiguous: bool,
2025-05-07T20:32:35.7704469Z         compiled: bool,
2025-05-07T20:32:35.7704548Z     ) -> None:
2025-05-07T20:32:35.7704642Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7704708Z     
2025-05-07T20:32:35.7704877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7704943Z     
2025-05-07T20:32:35.7705031Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7705164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7705248Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7705328Z         x0 = x[:, :D]
2025-05-07T20:32:35.7705408Z         x1 = x[:, D:]
2025-05-07T20:32:35.7705477Z     
2025-05-07T20:32:35.7705563Z         if contiguous:
2025-05-07T20:32:35.7705652Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7705741Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7705811Z     
2025-05-07T20:32:35.7705899Z         if scale_ub is not None:
2025-05-07T20:32:35.7706002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7706137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7706211Z             )
2025-05-07T20:32:35.7706282Z         else:
2025-05-07T20:32:35.7706376Z             scale_ub_tensor = None
2025-05-07T20:32:35.7706444Z     
2025-05-07T20:32:35.7706575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7706663Z             op = silu_mul_quant
2025-05-07T20:32:35.7706796Z             if compiled:
2025-05-07T20:32:35.7706894Z                 op = torch.compile(op)
2025-05-07T20:32:35.7706997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7707064Z     
2025-05-07T20:32:35.7707156Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7707161Z 
2025-05-07T20:32:35.7707258Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7710972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7711088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7711258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7711673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7711768Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7712268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7712366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7712728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7712953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7713293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7713385Z     kernel = self.compile(
2025-05-07T20:32:35.7713767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7713945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7714076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7714081Z 
2025-05-07T20:32:35.7714288Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1af0640>
2025-05-07T20:32:35.7715169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7715681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0c8b0>}
2025-05-07T20:32:35.7716421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7716620Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ac4570>
2025-05-07T20:32:35.7716625Z 
2025-05-07T20:32:35.7716789Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7717054Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7717164Z                            module_map=module_map)
2025-05-07T20:32:35.7717326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7717424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7717498Z E       ^
2025-05-07T20:32:35.7717849Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7717853Z 
2025-05-07T20:32:35.7718264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7718272Z 
2025-05-07T20:32:35.7718374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7718594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7718667Z     T=2048,
2025-05-07T20:32:35.7718742Z     D=7168,
2025-05-07T20:32:35.7718827Z     scale_ub=1200.0,
2025-05-07T20:32:35.7718909Z     contiguous=False,
2025-05-07T20:32:35.7719031Z     compiled=True,
2025-05-07T20:32:35.7719104Z )
2025-05-07T20:32:35.7719322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7719493Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7719498Z 
2025-05-07T20:32:35.7719575Z     @given(
2025-05-07T20:32:35.7719690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7719791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7719903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7720085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7720200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7720305Z     )
2025-05-07T20:32:35.7720555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7720653Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7720724Z         self,
2025-05-07T20:32:35.7720796Z         T: int,
2025-05-07T20:32:35.7720876Z         D: int,
2025-05-07T20:32:35.7720975Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7721067Z         contiguous: bool,
2025-05-07T20:32:35.7721153Z         compiled: bool,
2025-05-07T20:32:35.7721233Z     ) -> None:
2025-05-07T20:32:35.7721330Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7721399Z     
2025-05-07T20:32:35.7721566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7721637Z     
2025-05-07T20:32:35.7721728Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7721850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7721944Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7722021Z         x0 = x[:, :D]
2025-05-07T20:32:35.7722099Z         x1 = x[:, D:]
2025-05-07T20:32:35.7722174Z     
2025-05-07T20:32:35.7722259Z         if contiguous:
2025-05-07T20:32:35.7722347Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7722438Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7722515Z     
2025-05-07T20:32:35.7722649Z         if scale_ub is not None:
2025-05-07T20:32:35.7722756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7722891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7722965Z             )
2025-05-07T20:32:35.7723039Z         else:
2025-05-07T20:32:35.7723135Z             scale_ub_tensor = None
2025-05-07T20:32:35.7723202Z     
2025-05-07T20:32:35.7723327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7723416Z             op = silu_mul_quant
2025-05-07T20:32:35.7723504Z             if compiled:
2025-05-07T20:32:35.7723599Z                 op = torch.compile(op)
2025-05-07T20:32:35.7723706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7723779Z     
2025-05-07T20:32:35.7723869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7723874Z 
2025-05-07T20:32:35.7723970Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7724094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7724198Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7724296Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7724660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7724751Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7725238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7725332Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7725693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7725912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7726251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7726345Z     kernel = self.compile(
2025-05-07T20:32:35.7726775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7726947Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7727074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7727078Z 
2025-05-07T20:32:35.7727281Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1af2fe0>
2025-05-07T20:32:35.7728085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7728628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0d090>}
2025-05-07T20:32:35.7729371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7729562Z context = <triton._C.libtriton.ir.context object at 0x7fcec1a4adb0>
2025-05-07T20:32:35.7729567Z 
2025-05-07T20:32:35.7729730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7729992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7730094Z                            module_map=module_map)
2025-05-07T20:32:35.7730266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7730364Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7730436Z E       ^
2025-05-07T20:32:35.7730788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7730793Z 
2025-05-07T20:32:35.7731249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7731254Z 
2025-05-07T20:32:35.7731359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7731575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7731646Z     T=1,
2025-05-07T20:32:35.7731718Z     D=5120,
2025-05-07T20:32:35.7731795Z     scale_ub=None,
2025-05-07T20:32:35.7731880Z     contiguous=False,
2025-05-07T20:32:35.7731964Z     compiled=False,
2025-05-07T20:32:35.7732035Z )
2025-05-07T20:32:35.7732245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7732415Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.7732420Z 
2025-05-07T20:32:35.7732492Z     @given(
2025-05-07T20:32:35.7732607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7732706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7732824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7732943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7733053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7733123Z     )
2025-05-07T20:32:35.7733367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7733457Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7733526Z         self,
2025-05-07T20:32:35.7733603Z         T: int,
2025-05-07T20:32:35.7733677Z         D: int,
2025-05-07T20:32:35.7733777Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7733864Z         contiguous: bool,
2025-05-07T20:32:35.7733946Z         compiled: bool,
2025-05-07T20:32:35.7734021Z     ) -> None:
2025-05-07T20:32:35.7734120Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7734187Z     
2025-05-07T20:32:35.7734350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7734420Z     
2025-05-07T20:32:35.7734564Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7734683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7734775Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7734851Z         x0 = x[:, :D]
2025-05-07T20:32:35.7734926Z         x1 = x[:, D:]
2025-05-07T20:32:35.7734995Z     
2025-05-07T20:32:35.7735076Z         if contiguous:
2025-05-07T20:32:35.7735165Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7735252Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7735322Z     
2025-05-07T20:32:35.7735409Z         if scale_ub is not None:
2025-05-07T20:32:35.7735558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7735727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7735798Z             )
2025-05-07T20:32:35.7735876Z         else:
2025-05-07T20:32:35.7735967Z             scale_ub_tensor = None
2025-05-07T20:32:35.7736040Z     
2025-05-07T20:32:35.7736167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7736262Z             op = silu_mul_quant
2025-05-07T20:32:35.7736346Z             if compiled:
2025-05-07T20:32:35.7736443Z                 op = torch.compile(op)
2025-05-07T20:32:35.7736547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7736616Z     
2025-05-07T20:32:35.7736706Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7736710Z 
2025-05-07T20:32:35.7736803Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7736931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7737032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7737130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7737622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7737717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7738071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7738337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7738679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7738775Z     kernel = self.compile(
2025-05-07T20:32:35.7739156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7739329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7739454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7739458Z 
2025-05-07T20:32:35.7739661Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1a0b880>
2025-05-07T20:32:35.7740550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7741055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0d7e0>}
2025-05-07T20:32:35.7741795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7741981Z context = <triton._C.libtriton.ir.context object at 0x7fcec19fcbb0>
2025-05-07T20:32:35.7741988Z 
2025-05-07T20:32:35.7742151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7742417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7742519Z                            module_map=module_map)
2025-05-07T20:32:35.7742683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7742825Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7742896Z E       ^
2025-05-07T20:32:35.7743245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7743249Z 
2025-05-07T20:32:35.7743654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7743659Z 
2025-05-07T20:32:35.7743763Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7744022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7744091Z     T=4096,
2025-05-07T20:32:35.7744205Z     D=7168,
2025-05-07T20:32:35.7744288Z     scale_ub=1200.0,
2025-05-07T20:32:35.7744371Z     contiguous=False,
2025-05-07T20:32:35.7744454Z     compiled=False,
2025-05-07T20:32:35.7744527Z )
2025-05-07T20:32:35.7744738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7744918Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7744923Z 
2025-05-07T20:32:35.7744995Z     @given(
2025-05-07T20:32:35.7745116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7745212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7745327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7745445Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7745555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7745627Z     )
2025-05-07T20:32:35.7745872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7745964Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7746035Z         self,
2025-05-07T20:32:35.7746109Z         T: int,
2025-05-07T20:32:35.7746180Z         D: int,
2025-05-07T20:32:35.7746279Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7746366Z         contiguous: bool,
2025-05-07T20:32:35.7746502Z         compiled: bool,
2025-05-07T20:32:35.7746580Z     ) -> None:
2025-05-07T20:32:35.7746673Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7746740Z     
2025-05-07T20:32:35.7746908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7746979Z     
2025-05-07T20:32:35.7747069Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7747194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7747279Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7747356Z         x0 = x[:, :D]
2025-05-07T20:32:35.7747437Z         x1 = x[:, D:]
2025-05-07T20:32:35.7747508Z     
2025-05-07T20:32:35.7747588Z         if contiguous:
2025-05-07T20:32:35.7747684Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7747770Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7747837Z     
2025-05-07T20:32:35.7747923Z         if scale_ub is not None:
2025-05-07T20:32:35.7748025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7748161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7748237Z             )
2025-05-07T20:32:35.7748307Z         else:
2025-05-07T20:32:35.7748404Z             scale_ub_tensor = None
2025-05-07T20:32:35.7748472Z     
2025-05-07T20:32:35.7748598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7748688Z             op = silu_mul_quant
2025-05-07T20:32:35.7748769Z             if compiled:
2025-05-07T20:32:35.7748863Z                 op = torch.compile(op)
2025-05-07T20:32:35.7748969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7749038Z     
2025-05-07T20:32:35.7749129Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7749136Z 
2025-05-07T20:32:35.7749230Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7749354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7749455Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7749550Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7750121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7750220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7750570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7750793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7751126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7751258Z     kernel = self.compile(
2025-05-07T20:32:35.7751682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7751855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7751976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7751992Z 
2025-05-07T20:32:35.7752193Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1af8b50>
2025-05-07T20:32:35.7752956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7753457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0e200>}
2025-05-07T20:32:35.7754197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7754388Z context = <triton._C.libtriton.ir.context object at 0x7fcec1977fb0>
2025-05-07T20:32:35.7754392Z 
2025-05-07T20:32:35.7754599Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7754863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7754969Z                            module_map=module_map)
2025-05-07T20:32:35.7755130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7755226Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7755298Z E       ^
2025-05-07T20:32:35.7755646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7755653Z 
2025-05-07T20:32:35.7756070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7756075Z 
2025-05-07T20:32:35.7756176Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7756392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7756471Z     T=16384,
2025-05-07T20:32:35.7756547Z     D=7168,
2025-05-07T20:32:35.7756627Z     scale_ub=None,
2025-05-07T20:32:35.7756707Z     contiguous=True,
2025-05-07T20:32:35.7756788Z     compiled=True,
2025-05-07T20:32:35.7756859Z )
2025-05-07T20:32:35.7757071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7757242Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.7757246Z 
2025-05-07T20:32:35.7757320Z     @given(
2025-05-07T20:32:35.7757436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7757534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7757655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7757770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7757883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7757954Z     )
2025-05-07T20:32:35.7758198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7758335Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7758409Z         self,
2025-05-07T20:32:35.7758484Z         T: int,
2025-05-07T20:32:35.7758557Z         D: int,
2025-05-07T20:32:35.7758650Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7758737Z         contiguous: bool,
2025-05-07T20:32:35.7758823Z         compiled: bool,
2025-05-07T20:32:35.7758898Z     ) -> None:
2025-05-07T20:32:35.7758990Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7759062Z     
2025-05-07T20:32:35.7759269Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7759338Z     
2025-05-07T20:32:35.7759464Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7759590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7759677Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7759756Z         x0 = x[:, :D]
2025-05-07T20:32:35.7759834Z         x1 = x[:, D:]
2025-05-07T20:32:35.7759907Z     
2025-05-07T20:32:35.7759992Z         if contiguous:
2025-05-07T20:32:35.7760081Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7760168Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7760239Z     
2025-05-07T20:32:35.7760328Z         if scale_ub is not None:
2025-05-07T20:32:35.7760433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7760564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7760637Z             )
2025-05-07T20:32:35.7760710Z         else:
2025-05-07T20:32:35.7760802Z             scale_ub_tensor = None
2025-05-07T20:32:35.7760876Z     
2025-05-07T20:32:35.7761001Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7761088Z             op = silu_mul_quant
2025-05-07T20:32:35.7761172Z             if compiled:
2025-05-07T20:32:35.7761267Z                 op = torch.compile(op)
2025-05-07T20:32:35.7761369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7761439Z     
2025-05-07T20:32:35.7761529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7761576Z 
2025-05-07T20:32:35.7761672Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7761800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7761896Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7761996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7762354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7762446Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7762934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7763029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7763384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7763603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7763947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7764039Z     kernel = self.compile(
2025-05-07T20:32:35.7764420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7764591Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7764717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7764723Z 
2025-05-07T20:32:35.7764928Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1931c30>
2025-05-07T20:32:35.7765693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7766201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1a0f760>}
2025-05-07T20:32:35.7766991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7767183Z context = <triton._C.libtriton.ir.context object at 0x7fcec19b8e70>
2025-05-07T20:32:35.7767188Z 
2025-05-07T20:32:35.7767351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7767696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7767801Z                            module_map=module_map)
2025-05-07T20:32:35.7767968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7768071Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7768146Z E       ^
2025-05-07T20:32:35.7768498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7768503Z 
2025-05-07T20:32:35.7768915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7768920Z 
2025-05-07T20:32:35.7769019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7769239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7769312Z     T=4096,
2025-05-07T20:32:35.7769389Z     D=5120,
2025-05-07T20:32:35.7769470Z     scale_ub=None,
2025-05-07T20:32:35.7769553Z     contiguous=False,
2025-05-07T20:32:35.7769635Z     compiled=True,
2025-05-07T20:32:35.7769706Z )
2025-05-07T20:32:35.7769918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7770090Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7770098Z 
2025-05-07T20:32:35.7770213Z     @given(
2025-05-07T20:32:35.7770330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7770433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7770545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7770659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7770772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7770843Z     )
2025-05-07T20:32:35.7771089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7771182Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7771252Z         self,
2025-05-07T20:32:35.7771330Z         T: int,
2025-05-07T20:32:35.7771409Z         D: int,
2025-05-07T20:32:35.7771512Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7771598Z         contiguous: bool,
2025-05-07T20:32:35.7771681Z         compiled: bool,
2025-05-07T20:32:35.7771764Z     ) -> None:
2025-05-07T20:32:35.7771858Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7771933Z     
2025-05-07T20:32:35.7772101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7772169Z     
2025-05-07T20:32:35.7772260Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7772388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7772475Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7772553Z         x0 = x[:, :D]
2025-05-07T20:32:35.7772633Z         x1 = x[:, D:]
2025-05-07T20:32:35.7772699Z     
2025-05-07T20:32:35.7772784Z         if contiguous:
2025-05-07T20:32:35.7772875Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7772962Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7773032Z     
2025-05-07T20:32:35.7773119Z         if scale_ub is not None:
2025-05-07T20:32:35.7773221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7773353Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7773425Z             )
2025-05-07T20:32:35.7773548Z         else:
2025-05-07T20:32:35.7773643Z             scale_ub_tensor = None
2025-05-07T20:32:35.7773713Z     
2025-05-07T20:32:35.7773841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7773928Z             op = silu_mul_quant
2025-05-07T20:32:35.7774010Z             if compiled:
2025-05-07T20:32:35.7774104Z                 op = torch.compile(op)
2025-05-07T20:32:35.7774212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7774281Z     
2025-05-07T20:32:35.7774374Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7774421Z 
2025-05-07T20:32:35.7774517Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7774681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7774785Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7774880Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7775239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7775338Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7775830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7775924Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7776274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7776491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7776835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7776927Z     kernel = self.compile(
2025-05-07T20:32:35.7777310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7777490Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7777680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7777685Z 
2025-05-07T20:32:35.7777892Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1905cf0>
2025-05-07T20:32:35.7778655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7779150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f0280>}
2025-05-07T20:32:35.7780007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7780194Z context = <triton._C.libtriton.ir.context object at 0x7fd09606b730>
2025-05-07T20:32:35.7780203Z 
2025-05-07T20:32:35.7780368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7780629Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7780738Z                            module_map=module_map)
2025-05-07T20:32:35.7780896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7781017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7781095Z E       ^
2025-05-07T20:32:35.7781464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7781471Z 
2025-05-07T20:32:35.7781883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7781890Z 
2025-05-07T20:32:35.7781992Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7782210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7782331Z     T=4096,
2025-05-07T20:32:35.7782402Z     D=5120,
2025-05-07T20:32:35.7782482Z     scale_ub=1200.0,
2025-05-07T20:32:35.7782565Z     contiguous=False,
2025-05-07T20:32:35.7782645Z     compiled=False,
2025-05-07T20:32:35.7782713Z )
2025-05-07T20:32:35.7782926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7783096Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7783100Z 
2025-05-07T20:32:35.7783216Z     @given(
2025-05-07T20:32:35.7783331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7783467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7783585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7783702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7783812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7783888Z     )
2025-05-07T20:32:35.7784130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7784220Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7784300Z         self,
2025-05-07T20:32:35.7784370Z         T: int,
2025-05-07T20:32:35.7784440Z         D: int,
2025-05-07T20:32:35.7784539Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7784627Z         contiguous: bool,
2025-05-07T20:32:35.7784713Z         compiled: bool,
2025-05-07T20:32:35.7784787Z     ) -> None:
2025-05-07T20:32:35.7784877Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7784948Z     
2025-05-07T20:32:35.7785114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7785189Z     
2025-05-07T20:32:35.7785280Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7785402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7785487Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7785566Z         x0 = x[:, :D]
2025-05-07T20:32:35.7785646Z         x1 = x[:, D:]
2025-05-07T20:32:35.7785758Z     
2025-05-07T20:32:35.7785850Z         if contiguous:
2025-05-07T20:32:35.7785939Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7786028Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7786096Z     
2025-05-07T20:32:35.7786184Z         if scale_ub is not None:
2025-05-07T20:32:35.7786291Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7786422Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7786493Z             )
2025-05-07T20:32:35.7786571Z         else:
2025-05-07T20:32:35.7786660Z             scale_ub_tensor = None
2025-05-07T20:32:35.7786727Z     
2025-05-07T20:32:35.7786862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7786949Z             op = silu_mul_quant
2025-05-07T20:32:35.7787031Z             if compiled:
2025-05-07T20:32:35.7787129Z                 op = torch.compile(op)
2025-05-07T20:32:35.7787231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7787304Z     
2025-05-07T20:32:35.7787391Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7787396Z 
2025-05-07T20:32:35.7787490Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7787617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7787714Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7787808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7788307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7788404Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7788765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7788984Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7789325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7789472Z     kernel = self.compile(
2025-05-07T20:32:35.7790058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7790300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7790429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7790433Z 
2025-05-07T20:32:35.7790632Z self = <triton.compiler.compiler.ASTSource object at 0x7fd096036230>
2025-05-07T20:32:35.7791549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7792051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f1000>}
2025-05-07T20:32:35.7792795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7792982Z context = <triton._C.libtriton.ir.context object at 0x7fd096084c70>
2025-05-07T20:32:35.7792986Z 
2025-05-07T20:32:35.7793145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7793406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7793513Z                            module_map=module_map)
2025-05-07T20:32:35.7793673Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7793769Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7793840Z E       ^
2025-05-07T20:32:35.7794193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7794263Z 
2025-05-07T20:32:35.7794677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7794682Z 
2025-05-07T20:32:35.7794783Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7795003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7795077Z     T=4096,
2025-05-07T20:32:35.7795150Z     D=5120,
2025-05-07T20:32:35.7795230Z     scale_ub=1200.0,
2025-05-07T20:32:35.7795315Z     contiguous=False,
2025-05-07T20:32:35.7795399Z     compiled=True,
2025-05-07T20:32:35.7795465Z )
2025-05-07T20:32:35.7795680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7795855Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7795860Z 
2025-05-07T20:32:35.7795932Z     @given(
2025-05-07T20:32:35.7796044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7796149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7796260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7796374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7796483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7796552Z     )
2025-05-07T20:32:35.7796793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7796886Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7796958Z         self,
2025-05-07T20:32:35.7797033Z         T: int,
2025-05-07T20:32:35.7797105Z         D: int,
2025-05-07T20:32:35.7797200Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7797291Z         contiguous: bool,
2025-05-07T20:32:35.7797370Z         compiled: bool,
2025-05-07T20:32:35.7797446Z     ) -> None:
2025-05-07T20:32:35.7797539Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7797606Z     
2025-05-07T20:32:35.7797773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7797912Z     
2025-05-07T20:32:35.7798000Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7798127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7798212Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7798289Z         x0 = x[:, :D]
2025-05-07T20:32:35.7798365Z         x1 = x[:, D:]
2025-05-07T20:32:35.7798430Z     
2025-05-07T20:32:35.7798509Z         if contiguous:
2025-05-07T20:32:35.7798601Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7798687Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7798796Z     
2025-05-07T20:32:35.7798890Z         if scale_ub is not None:
2025-05-07T20:32:35.7799033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7799168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7799243Z             )
2025-05-07T20:32:35.7799318Z         else:
2025-05-07T20:32:35.7799412Z             scale_ub_tensor = None
2025-05-07T20:32:35.7799482Z     
2025-05-07T20:32:35.7799612Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7799702Z             op = silu_mul_quant
2025-05-07T20:32:35.7799786Z             if compiled:
2025-05-07T20:32:35.7799881Z                 op = torch.compile(op)
2025-05-07T20:32:35.7799989Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7800057Z     
2025-05-07T20:32:35.7800145Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7800149Z 
2025-05-07T20:32:35.7800248Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7800371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7800476Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7800575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7800933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7801025Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7801558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7801654Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7802007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7802222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7802563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7802655Z     kernel = self.compile(
2025-05-07T20:32:35.7803030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7803206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7803326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7803333Z 
2025-05-07T20:32:35.7803538Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1933250>
2025-05-07T20:32:35.7804307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7804796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f0700>}
2025-05-07T20:32:35.7805535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7805721Z context = <triton._C.libtriton.ir.context object at 0x7fcec1888f30>
2025-05-07T20:32:35.7805726Z 
2025-05-07T20:32:35.7805893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7806202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7806306Z                            module_map=module_map)
2025-05-07T20:32:35.7806475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7806569Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7806640Z E       ^
2025-05-07T20:32:35.7806990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7807035Z 
2025-05-07T20:32:35.7807509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7807514Z 
2025-05-07T20:32:35.7807617Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7807832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7807902Z     T=2048,
2025-05-07T20:32:35.7807977Z     D=7168,
2025-05-07T20:32:35.7808057Z     scale_ub=1200.0,
2025-05-07T20:32:35.7808140Z     contiguous=False,
2025-05-07T20:32:35.7808223Z     compiled=False,
2025-05-07T20:32:35.7808289Z )
2025-05-07T20:32:35.7808502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7808673Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7808677Z 
2025-05-07T20:32:35.7808748Z     @given(
2025-05-07T20:32:35.7808868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7808966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7809076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7809193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7809305Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7809378Z     )
2025-05-07T20:32:35.7809622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7809760Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7809838Z         self,
2025-05-07T20:32:35.7809910Z         T: int,
2025-05-07T20:32:35.7809983Z         D: int,
2025-05-07T20:32:35.7810082Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7810167Z         contiguous: bool,
2025-05-07T20:32:35.7810249Z         compiled: bool,
2025-05-07T20:32:35.7810326Z     ) -> None:
2025-05-07T20:32:35.7810418Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7810486Z     
2025-05-07T20:32:35.7810654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7810727Z     
2025-05-07T20:32:35.7810814Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7810950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7811051Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7811140Z         x0 = x[:, :D]
2025-05-07T20:32:35.7811229Z         x1 = x[:, D:]
2025-05-07T20:32:35.7811296Z     
2025-05-07T20:32:35.7811377Z         if contiguous:
2025-05-07T20:32:35.7811471Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7811559Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7811633Z     
2025-05-07T20:32:35.7811719Z         if scale_ub is not None:
2025-05-07T20:32:35.7811822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7811955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7812027Z             )
2025-05-07T20:32:35.7812098Z         else:
2025-05-07T20:32:35.7812194Z             scale_ub_tensor = None
2025-05-07T20:32:35.7812261Z     
2025-05-07T20:32:35.7812392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7812479Z             op = silu_mul_quant
2025-05-07T20:32:35.7812564Z             if compiled:
2025-05-07T20:32:35.7812666Z                 op = torch.compile(op)
2025-05-07T20:32:35.7812771Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7812838Z     
2025-05-07T20:32:35.7812930Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7812979Z 
2025-05-07T20:32:35.7813076Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7813201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7813301Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7813397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7813888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7813980Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7814332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7814636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7814980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7815071Z     kernel = self.compile(
2025-05-07T20:32:35.7815463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7815635Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7815763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7815768Z 
2025-05-07T20:32:35.7815966Z self = <triton.compiler.compiler.ASTSource object at 0x7fd0960a49a0>
2025-05-07T20:32:35.7816728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7817238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f1240>}
2025-05-07T20:32:35.7818013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7818206Z context = <triton._C.libtriton.ir.context object at 0x7fcec186e730>
2025-05-07T20:32:35.7818211Z 
2025-05-07T20:32:35.7818369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7818628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7818730Z                            module_map=module_map)
2025-05-07T20:32:35.7818895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7818992Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7819069Z E       ^
2025-05-07T20:32:35.7819417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7819421Z 
2025-05-07T20:32:35.7819968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7819975Z 
2025-05-07T20:32:35.7820076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7820294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7820370Z     T=1,
2025-05-07T20:32:35.7820445Z     D=7168,
2025-05-07T20:32:35.7820524Z     scale_ub=None,
2025-05-07T20:32:35.7820605Z     contiguous=True,
2025-05-07T20:32:35.7820686Z     compiled=False,
2025-05-07T20:32:35.7820757Z )
2025-05-07T20:32:35.7820966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7821134Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.7821142Z 
2025-05-07T20:32:35.7821215Z     @given(
2025-05-07T20:32:35.7821330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7821428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7821540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7821705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7821820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7821889Z     )
2025-05-07T20:32:35.7822128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7822222Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7822293Z         self,
2025-05-07T20:32:35.7822365Z         T: int,
2025-05-07T20:32:35.7822444Z         D: int,
2025-05-07T20:32:35.7822541Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7822672Z         contiguous: bool,
2025-05-07T20:32:35.7822754Z         compiled: bool,
2025-05-07T20:32:35.7822828Z     ) -> None:
2025-05-07T20:32:35.7822963Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7823032Z     
2025-05-07T20:32:35.7823197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7823269Z     
2025-05-07T20:32:35.7823356Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7823482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7823572Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7823644Z         x0 = x[:, :D]
2025-05-07T20:32:35.7823720Z         x1 = x[:, D:]
2025-05-07T20:32:35.7823794Z     
2025-05-07T20:32:35.7823873Z         if contiguous:
2025-05-07T20:32:35.7823963Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7824054Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7824120Z     
2025-05-07T20:32:35.7824211Z         if scale_ub is not None:
2025-05-07T20:32:35.7824318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7824447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7824523Z             )
2025-05-07T20:32:35.7824596Z         else:
2025-05-07T20:32:35.7824687Z             scale_ub_tensor = None
2025-05-07T20:32:35.7824760Z     
2025-05-07T20:32:35.7824885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7824973Z             op = silu_mul_quant
2025-05-07T20:32:35.7825103Z             if compiled:
2025-05-07T20:32:35.7825204Z                 op = torch.compile(op)
2025-05-07T20:32:35.7825311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7825382Z     
2025-05-07T20:32:35.7825470Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7825475Z 
2025-05-07T20:32:35.7825570Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7825694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7825791Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7825891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7826382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7826477Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7826832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7827057Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7827400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7827494Z     kernel = self.compile(
2025-05-07T20:32:35.7827872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7828044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7828166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7828173Z 
2025-05-07T20:32:35.7828381Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec183c220>
2025-05-07T20:32:35.7829144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7832485Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f2050>}
2025-05-07T20:32:35.7833260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7833455Z context = <triton._C.libtriton.ir.context object at 0x7fcec18ea430>
2025-05-07T20:32:35.7833528Z 
2025-05-07T20:32:35.7833694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7833999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7834108Z                            module_map=module_map)
2025-05-07T20:32:35.7834266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7834368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7834452Z E       ^
2025-05-07T20:32:35.7834803Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7834808Z 
2025-05-07T20:32:35.7835222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7835227Z 
2025-05-07T20:32:35.7835326Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7835545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7835627Z     T=16384,
2025-05-07T20:32:35.7835699Z     D=7168,
2025-05-07T20:32:35.7835782Z     scale_ub=1200.0,
2025-05-07T20:32:35.7835865Z     contiguous=False,
2025-05-07T20:32:35.7835944Z     compiled=True,
2025-05-07T20:32:35.7836014Z )
2025-05-07T20:32:35.7836233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7836450Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7836455Z 
2025-05-07T20:32:35.7836530Z     @given(
2025-05-07T20:32:35.7836649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7836746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7836861Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7836974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7837085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7837162Z     )
2025-05-07T20:32:35.7837412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7837508Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7837580Z         self,
2025-05-07T20:32:35.7837650Z         T: int,
2025-05-07T20:32:35.7837724Z         D: int,
2025-05-07T20:32:35.7837820Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7837906Z         contiguous: bool,
2025-05-07T20:32:35.7837993Z         compiled: bool,
2025-05-07T20:32:35.7838074Z     ) -> None:
2025-05-07T20:32:35.7838168Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7838241Z     
2025-05-07T20:32:35.7838407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7838475Z     
2025-05-07T20:32:35.7838566Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7838688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7838776Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7838854Z         x0 = x[:, :D]
2025-05-07T20:32:35.7838933Z         x1 = x[:, D:]
2025-05-07T20:32:35.7839007Z     
2025-05-07T20:32:35.7839087Z         if contiguous:
2025-05-07T20:32:35.7839178Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7839266Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7839332Z     
2025-05-07T20:32:35.7839418Z         if scale_ub is not None:
2025-05-07T20:32:35.7839527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7839665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7839782Z             )
2025-05-07T20:32:35.7839856Z         else:
2025-05-07T20:32:35.7839947Z             scale_ub_tensor = None
2025-05-07T20:32:35.7840013Z     
2025-05-07T20:32:35.7840145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7840232Z             op = silu_mul_quant
2025-05-07T20:32:35.7840317Z             if compiled:
2025-05-07T20:32:35.7840415Z                 op = torch.compile(op)
2025-05-07T20:32:35.7840517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7840655Z     
2025-05-07T20:32:35.7840745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7840750Z 
2025-05-07T20:32:35.7840972Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7841122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7841243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7841338Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7841716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7841810Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7842299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7842395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7842748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7842973Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7843311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7843406Z     kernel = self.compile(
2025-05-07T20:32:35.7843782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7843999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7844128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7844133Z 
2025-05-07T20:32:35.7844337Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1868c10>
2025-05-07T20:32:35.7845101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7845601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f3490>}
2025-05-07T20:32:35.7846334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7846532Z context = <triton._C.libtriton.ir.context object at 0x7fcec1b868b0>
2025-05-07T20:32:35.7846537Z 
2025-05-07T20:32:35.7846702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7846963Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7847067Z                            module_map=module_map)
2025-05-07T20:32:35.7847228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7847326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7847401Z E       ^
2025-05-07T20:32:35.7847750Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7847754Z 
2025-05-07T20:32:35.7848170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7848175Z 
2025-05-07T20:32:35.7848318Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7848540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7848613Z     T=1,
2025-05-07T20:32:35.7848683Z     D=7168,
2025-05-07T20:32:35.7848764Z     scale_ub=None,
2025-05-07T20:32:35.7848846Z     contiguous=False,
2025-05-07T20:32:35.7848927Z     compiled=False,
2025-05-07T20:32:35.7848998Z )
2025-05-07T20:32:35.7849212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7849377Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.7849424Z 
2025-05-07T20:32:35.7849498Z     @given(
2025-05-07T20:32:35.7849652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7849753Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7849869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7849985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7850108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7850180Z     )
2025-05-07T20:32:35.7850420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7850513Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7850584Z         self,
2025-05-07T20:32:35.7850659Z         T: int,
2025-05-07T20:32:35.7850733Z         D: int,
2025-05-07T20:32:35.7850832Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7850924Z         contiguous: bool,
2025-05-07T20:32:35.7851005Z         compiled: bool,
2025-05-07T20:32:35.7851084Z     ) -> None:
2025-05-07T20:32:35.7851179Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7851247Z     
2025-05-07T20:32:35.7851416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7851490Z     
2025-05-07T20:32:35.7851579Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7851702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7851788Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7851910Z         x0 = x[:, :D]
2025-05-07T20:32:35.7851989Z         x1 = x[:, D:]
2025-05-07T20:32:35.7852061Z     
2025-05-07T20:32:35.7852142Z         if contiguous:
2025-05-07T20:32:35.7852235Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7852323Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7852389Z     
2025-05-07T20:32:35.7852479Z         if scale_ub is not None:
2025-05-07T20:32:35.7852582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7852715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7852796Z             )
2025-05-07T20:32:35.7852866Z         else:
2025-05-07T20:32:35.7852957Z             scale_ub_tensor = None
2025-05-07T20:32:35.7853031Z     
2025-05-07T20:32:35.7853168Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7853253Z             op = silu_mul_quant
2025-05-07T20:32:35.7853335Z             if compiled:
2025-05-07T20:32:35.7853436Z                 op = torch.compile(op)
2025-05-07T20:32:35.7853547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7853613Z     
2025-05-07T20:32:35.7853707Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7853711Z 
2025-05-07T20:32:35.7853805Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7853933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7854030Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7854124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7854615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7854712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7855063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7855281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7855670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7855765Z     kernel = self.compile(
2025-05-07T20:32:35.7856139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7856311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7856436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7856440Z 
2025-05-07T20:32:35.7856686Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1bfaa10>
2025-05-07T20:32:35.7857493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7857996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd0960f37f0>}
2025-05-07T20:32:35.7858734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7858925Z context = <triton._C.libtriton.ir.context object at 0x7fcec1b38530>
2025-05-07T20:32:35.7858929Z 
2025-05-07T20:32:35.7859094Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7859362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7859468Z                            module_map=module_map)
2025-05-07T20:32:35.7859629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7859728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7859934Z E       ^
2025-05-07T20:32:35.7860323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7860335Z 
2025-05-07T20:32:35.7860749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7860754Z 
2025-05-07T20:32:35.7860854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7861073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7861145Z     T=2048,
2025-05-07T20:32:35.7861215Z     D=7168,
2025-05-07T20:32:35.7861300Z     scale_ub=None,
2025-05-07T20:32:35.7861382Z     contiguous=False,
2025-05-07T20:32:35.7861461Z     compiled=True,
2025-05-07T20:32:35.7861538Z )
2025-05-07T20:32:35.7861748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7861922Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7861926Z 
2025-05-07T20:32:35.7861999Z     @given(
2025-05-07T20:32:35.7862115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7862216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7862328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7862444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7862555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7862624Z     )
2025-05-07T20:32:35.7862874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7862964Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7863038Z         self,
2025-05-07T20:32:35.7863112Z         T: int,
2025-05-07T20:32:35.7863185Z         D: int,
2025-05-07T20:32:35.7863282Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7863371Z         contiguous: bool,
2025-05-07T20:32:35.7863453Z         compiled: bool,
2025-05-07T20:32:35.7863524Z     ) -> None:
2025-05-07T20:32:35.7863618Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7863733Z     
2025-05-07T20:32:35.7863902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7863974Z     
2025-05-07T20:32:35.7864064Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7864184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7864273Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7864350Z         x0 = x[:, :D]
2025-05-07T20:32:35.7864428Z         x1 = x[:, D:]
2025-05-07T20:32:35.7864496Z     
2025-05-07T20:32:35.7864576Z         if contiguous:
2025-05-07T20:32:35.7864711Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7864795Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7864862Z     
2025-05-07T20:32:35.7864993Z         if scale_ub is not None:
2025-05-07T20:32:35.7865096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7865227Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7865303Z             )
2025-05-07T20:32:35.7865375Z         else:
2025-05-07T20:32:35.7865472Z             scale_ub_tensor = None
2025-05-07T20:32:35.7865544Z     
2025-05-07T20:32:35.7865672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7865760Z             op = silu_mul_quant
2025-05-07T20:32:35.7865843Z             if compiled:
2025-05-07T20:32:35.7865941Z                 op = torch.compile(op)
2025-05-07T20:32:35.7866045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7866114Z     
2025-05-07T20:32:35.7866202Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7866206Z 
2025-05-07T20:32:35.7866306Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7866431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7866531Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7866629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7866989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7867086Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7867621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7867716Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7868076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7868292Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7868625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7868722Z     kernel = self.compile(
2025-05-07T20:32:35.7869099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7869273Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7869398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7869405Z 
2025-05-07T20:32:35.7869608Z self = <triton.compiler.compiler.ASTSource object at 0x7fd09606d990>
2025-05-07T20:32:35.7870374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7870876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b50af0>}
2025-05-07T20:32:35.7871675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7871863Z context = <triton._C.libtriton.ir.context object at 0x7fcec1cf57f0>
2025-05-07T20:32:35.7871934Z 
2025-05-07T20:32:35.7872104Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7872359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7872463Z                            module_map=module_map)
2025-05-07T20:32:35.7872625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7872720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7872791Z E       ^
2025-05-07T20:32:35.7873140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7873185Z 
2025-05-07T20:32:35.7873637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7873642Z 
2025-05-07T20:32:35.7873749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7873967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7874042Z     T=4096,
2025-05-07T20:32:35.7874116Z     D=7168,
2025-05-07T20:32:35.7874191Z     scale_ub=None,
2025-05-07T20:32:35.7874274Z     contiguous=False,
2025-05-07T20:32:35.7874355Z     compiled=True,
2025-05-07T20:32:35.7874421Z )
2025-05-07T20:32:35.7874630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7874804Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7874808Z 
2025-05-07T20:32:35.7874881Z     @given(
2025-05-07T20:32:35.7874999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7875095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7875209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7875329Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7875437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7875506Z     )
2025-05-07T20:32:35.7875791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7875887Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7875959Z         self,
2025-05-07T20:32:35.7876030Z         T: int,
2025-05-07T20:32:35.7876103Z         D: int,
2025-05-07T20:32:35.7876202Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7876287Z         contiguous: bool,
2025-05-07T20:32:35.7876368Z         compiled: bool,
2025-05-07T20:32:35.7876444Z     ) -> None:
2025-05-07T20:32:35.7876535Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7876601Z     
2025-05-07T20:32:35.7876773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7876842Z     
2025-05-07T20:32:35.7876930Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7877054Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7877139Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7877216Z         x0 = x[:, :D]
2025-05-07T20:32:35.7877296Z         x1 = x[:, D:]
2025-05-07T20:32:35.7877365Z     
2025-05-07T20:32:35.7877449Z         if contiguous:
2025-05-07T20:32:35.7877536Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7877621Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7877689Z     
2025-05-07T20:32:35.7877775Z         if scale_ub is not None:
2025-05-07T20:32:35.7877877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7878010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7878080Z             )
2025-05-07T20:32:35.7878150Z         else:
2025-05-07T20:32:35.7878241Z             scale_ub_tensor = None
2025-05-07T20:32:35.7878317Z     
2025-05-07T20:32:35.7878444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7878536Z             op = silu_mul_quant
2025-05-07T20:32:35.7878619Z             if compiled:
2025-05-07T20:32:35.7878719Z                 op = torch.compile(op)
2025-05-07T20:32:35.7878822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7878889Z     
2025-05-07T20:32:35.7879029Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7879033Z 
2025-05-07T20:32:35.7879129Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7879254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7879355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7879451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7879818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7879910Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7880484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7880582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7880938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7881161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7881505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7881595Z     kernel = self.compile(
2025-05-07T20:32:35.7881978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7882149Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7882269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7882277Z 
2025-05-07T20:32:35.7882485Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1b652a0>
2025-05-07T20:32:35.7883248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7883786Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b50280>}
2025-05-07T20:32:35.7884534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7884722Z context = <triton._C.libtriton.ir.context object at 0x7fcec1ca42b0>
2025-05-07T20:32:35.7884726Z 
2025-05-07T20:32:35.7884894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7885152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7885257Z                            module_map=module_map)
2025-05-07T20:32:35.7885417Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7885509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7885588Z E       ^
2025-05-07T20:32:35.7885936Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7885940Z 
2025-05-07T20:32:35.7886354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7886358Z 
2025-05-07T20:32:35.7886456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7886677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7886757Z     T=16384,
2025-05-07T20:32:35.7886829Z     D=5120,
2025-05-07T20:32:35.7886909Z     scale_ub=1200.0,
2025-05-07T20:32:35.7886994Z     contiguous=False,
2025-05-07T20:32:35.7887074Z     compiled=False,
2025-05-07T20:32:35.7887142Z )
2025-05-07T20:32:35.7887356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7887531Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.7887583Z 
2025-05-07T20:32:35.7887661Z     @given(
2025-05-07T20:32:35.7887777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7887872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7887987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7888101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7888214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7888290Z     )
2025-05-07T20:32:35.7888537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7888667Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7888740Z         self,
2025-05-07T20:32:35.7888849Z         T: int,
2025-05-07T20:32:35.7888923Z         D: int,
2025-05-07T20:32:35.7889019Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7889105Z         contiguous: bool,
2025-05-07T20:32:35.7889190Z         compiled: bool,
2025-05-07T20:32:35.7889263Z     ) -> None:
2025-05-07T20:32:35.7889361Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7889432Z     
2025-05-07T20:32:35.7889598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7889664Z     
2025-05-07T20:32:35.7889754Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7890494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7890625Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7890718Z         x0 = x[:, :D]
2025-05-07T20:32:35.7890798Z         x1 = x[:, D:]
2025-05-07T20:32:35.7890880Z     
2025-05-07T20:32:35.7890974Z         if contiguous:
2025-05-07T20:32:35.7891064Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7891159Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7891229Z     
2025-05-07T20:32:35.7891319Z         if scale_ub is not None:
2025-05-07T20:32:35.7891430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7891567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7891648Z             )
2025-05-07T20:32:35.7891866Z         else:
2025-05-07T20:32:35.7891965Z             scale_ub_tensor = None
2025-05-07T20:32:35.7892039Z     
2025-05-07T20:32:35.7892173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7892262Z             op = silu_mul_quant
2025-05-07T20:32:35.7892349Z             if compiled:
2025-05-07T20:32:35.7892448Z                 op = torch.compile(op)
2025-05-07T20:32:35.7892555Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7892629Z     
2025-05-07T20:32:35.7892720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7892729Z 
2025-05-07T20:32:35.7892826Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7892960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7893059Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7893164Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7893669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7893767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7894130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7894349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7894690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7894781Z     kernel = self.compile(
2025-05-07T20:32:35.7895169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7895347Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7895472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7895476Z 
2025-05-07T20:32:35.7895683Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1cbd570>
2025-05-07T20:32:35.7896519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7897017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b52d40>}
2025-05-07T20:32:35.7897828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7898089Z context = <triton._C.libtriton.ir.context object at 0x7fcec1741370>
2025-05-07T20:32:35.7898094Z 
2025-05-07T20:32:35.7898261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7898526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7898631Z                            module_map=module_map)
2025-05-07T20:32:35.7898793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7898888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7898963Z E       ^
2025-05-07T20:32:35.7899317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7899322Z 
2025-05-07T20:32:35.7899740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7899744Z 
2025-05-07T20:32:35.7900033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7900357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7900461Z     T=16384,
2025-05-07T20:32:35.7900568Z     D=5120,
2025-05-07T20:32:35.7900674Z     scale_ub=1200.0,
2025-05-07T20:32:35.7900821Z     contiguous=True,
2025-05-07T20:32:35.7900903Z     compiled=True,
2025-05-07T20:32:35.7900975Z )
2025-05-07T20:32:35.7901214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7901409Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7901414Z 
2025-05-07T20:32:35.7901487Z     @given(
2025-05-07T20:32:35.7901610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7901704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7901819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7901936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7902049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7902118Z     )
2025-05-07T20:32:35.7902361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7902450Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7902529Z         self,
2025-05-07T20:32:35.7902603Z         T: int,
2025-05-07T20:32:35.7902676Z         D: int,
2025-05-07T20:32:35.7902774Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7902860Z         contiguous: bool,
2025-05-07T20:32:35.7902944Z         compiled: bool,
2025-05-07T20:32:35.7903025Z     ) -> None:
2025-05-07T20:32:35.7903114Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7903184Z     
2025-05-07T20:32:35.7903353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7903428Z     
2025-05-07T20:32:35.7903522Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7903645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7903733Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7903810Z         x0 = x[:, :D]
2025-05-07T20:32:35.7903886Z         x1 = x[:, D:]
2025-05-07T20:32:35.7903956Z     
2025-05-07T20:32:35.7904040Z         if contiguous:
2025-05-07T20:32:35.7904130Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7904269Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7904342Z     
2025-05-07T20:32:35.7904433Z         if scale_ub is not None:
2025-05-07T20:32:35.7904538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7904677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7904751Z             )
2025-05-07T20:32:35.7904826Z         else:
2025-05-07T20:32:35.7904919Z             scale_ub_tensor = None
2025-05-07T20:32:35.7904986Z     
2025-05-07T20:32:35.7905109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7905246Z             op = silu_mul_quant
2025-05-07T20:32:35.7905328Z             if compiled:
2025-05-07T20:32:35.7905466Z                 op = torch.compile(op)
2025-05-07T20:32:35.7905571Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7905644Z     
2025-05-07T20:32:35.7905742Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7905747Z 
2025-05-07T20:32:35.7905842Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7905970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7906069Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7906167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7906534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7906630Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7907116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7907217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7907570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7907786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7908172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7908269Z     kernel = self.compile(
2025-05-07T20:32:35.7908650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7908821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7908942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7908946Z 
2025-05-07T20:32:35.7909150Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1c844c0>
2025-05-07T20:32:35.7909916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7910428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b52830>}
2025-05-07T20:32:35.7911164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7911350Z context = <triton._C.libtriton.ir.context object at 0x7fcec1766170>
2025-05-07T20:32:35.7911357Z 
2025-05-07T20:32:35.7911517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7911774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7911890Z                            module_map=module_map)
2025-05-07T20:32:35.7912053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7912147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7912223Z E       ^
2025-05-07T20:32:35.7912575Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7912622Z 
2025-05-07T20:32:35.7913037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7913041Z 
2025-05-07T20:32:35.7913142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7913358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7913435Z     T=16384,
2025-05-07T20:32:35.7913507Z     D=5120,
2025-05-07T20:32:35.7913583Z     scale_ub=None,
2025-05-07T20:32:35.7913712Z     contiguous=False,
2025-05-07T20:32:35.7913791Z     compiled=True,
2025-05-07T20:32:35.7913858Z )
2025-05-07T20:32:35.7914109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7914282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7914287Z 
2025-05-07T20:32:35.7914362Z     @given(
2025-05-07T20:32:35.7914477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7914578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7914696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7914814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7914927Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7915000Z     )
2025-05-07T20:32:35.7915246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7915339Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7915412Z         self,
2025-05-07T20:32:35.7915488Z         T: int,
2025-05-07T20:32:35.7915564Z         D: int,
2025-05-07T20:32:35.7915662Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7915748Z         contiguous: bool,
2025-05-07T20:32:35.7915833Z         compiled: bool,
2025-05-07T20:32:35.7915908Z     ) -> None:
2025-05-07T20:32:35.7915998Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7916070Z     
2025-05-07T20:32:35.7916282Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7916353Z     
2025-05-07T20:32:35.7916446Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7916566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7916652Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7916732Z         x0 = x[:, :D]
2025-05-07T20:32:35.7916810Z         x1 = x[:, D:]
2025-05-07T20:32:35.7916884Z     
2025-05-07T20:32:35.7916966Z         if contiguous:
2025-05-07T20:32:35.7917054Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7917146Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7917216Z     
2025-05-07T20:32:35.7917303Z         if scale_ub is not None:
2025-05-07T20:32:35.7917413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7917541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7917611Z             )
2025-05-07T20:32:35.7917688Z         else:
2025-05-07T20:32:35.7917780Z             scale_ub_tensor = None
2025-05-07T20:32:35.7917852Z     
2025-05-07T20:32:35.7917982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7918069Z             op = silu_mul_quant
2025-05-07T20:32:35.7918151Z             if compiled:
2025-05-07T20:32:35.7918249Z                 op = torch.compile(op)
2025-05-07T20:32:35.7918351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7918424Z     
2025-05-07T20:32:35.7918512Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7918516Z 
2025-05-07T20:32:35.7918611Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7918742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7918841Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7918940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7919301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7919390Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7919934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7920033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7920390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7920610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7920945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7921103Z     kernel = self.compile(
2025-05-07T20:32:35.7921527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7921701Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7921825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7921832Z 
2025-05-07T20:32:35.7922036Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1cb8850>
2025-05-07T20:32:35.7922798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7923303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1b53760>}
2025-05-07T20:32:35.7924042Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7924230Z context = <triton._C.libtriton.ir.context object at 0x7fcec16fc830>
2025-05-07T20:32:35.7924235Z 
2025-05-07T20:32:35.7924396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7924700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7924804Z                            module_map=module_map)
2025-05-07T20:32:35.7924963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7925061Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7925135Z E       ^
2025-05-07T20:32:35.7925482Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7925489Z 
2025-05-07T20:32:35.7925898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7925903Z 
2025-05-07T20:32:35.7926005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7926222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7926295Z     T=2048,
2025-05-07T20:32:35.7926372Z     D=5120,
2025-05-07T20:32:35.7926452Z     scale_ub=None,
2025-05-07T20:32:35.7926534Z     contiguous=False,
2025-05-07T20:32:35.7926611Z     compiled=True,
2025-05-07T20:32:35.7926685Z )
2025-05-07T20:32:35.7926895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7927062Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.7927071Z 
2025-05-07T20:32:35.7927143Z     @given(
2025-05-07T20:32:35.7927257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7927359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7927472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7927588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7927700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7927771Z     )
2025-05-07T20:32:35.7928012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7928154Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7928224Z         self,
2025-05-07T20:32:35.7928295Z         T: int,
2025-05-07T20:32:35.7928370Z         D: int,
2025-05-07T20:32:35.7928464Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7928554Z         contiguous: bool,
2025-05-07T20:32:35.7928635Z         compiled: bool,
2025-05-07T20:32:35.7928709Z     ) -> None:
2025-05-07T20:32:35.7928804Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7928874Z     
2025-05-07T20:32:35.7929035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7929153Z     
2025-05-07T20:32:35.7929242Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7929402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7929494Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7929571Z         x0 = x[:, :D]
2025-05-07T20:32:35.7929646Z         x1 = x[:, D:]
2025-05-07T20:32:35.7929717Z     
2025-05-07T20:32:35.7929800Z         if contiguous:
2025-05-07T20:32:35.7929892Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7929981Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7930050Z     
2025-05-07T20:32:35.7930140Z         if scale_ub is not None:
2025-05-07T20:32:35.7930244Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7930375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7930453Z             )
2025-05-07T20:32:35.7930526Z         else:
2025-05-07T20:32:35.7930617Z             scale_ub_tensor = None
2025-05-07T20:32:35.7930688Z     
2025-05-07T20:32:35.7930816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7930902Z             op = silu_mul_quant
2025-05-07T20:32:35.7931002Z             if compiled:
2025-05-07T20:32:35.7931117Z                 op = torch.compile(op)
2025-05-07T20:32:35.7931236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7931317Z     
2025-05-07T20:32:35.7931408Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7931414Z 
2025-05-07T20:32:35.7931559Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7931685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7931785Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7931885Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7932244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7932336Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7932832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7932933Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7933291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7933510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7933851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7933951Z     kernel = self.compile(
2025-05-07T20:32:35.7934332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7934510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7934632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7934636Z 
2025-05-07T20:32:35.7934841Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1780220>
2025-05-07T20:32:35.7935608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7936113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164c3a0>}
2025-05-07T20:32:35.7936910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7937095Z context = <triton._C.libtriton.ir.context object at 0x7fcec1613530>
2025-05-07T20:32:35.7937100Z 
2025-05-07T20:32:35.7937260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7937562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7937705Z                            module_map=module_map)
2025-05-07T20:32:35.7937878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7937972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7938046Z E       ^
2025-05-07T20:32:35.7938405Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7938410Z 
2025-05-07T20:32:35.7938822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7938827Z 
2025-05-07T20:32:35.7938933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7939150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7939221Z     T=2048,
2025-05-07T20:32:35.7939297Z     D=5120,
2025-05-07T20:32:35.7939378Z     scale_ub=1200.0,
2025-05-07T20:32:35.7939460Z     contiguous=False,
2025-05-07T20:32:35.7939541Z     compiled=True,
2025-05-07T20:32:35.7939611Z )
2025-05-07T20:32:35.7939905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7940096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7940101Z 
2025-05-07T20:32:35.7940180Z     @given(
2025-05-07T20:32:35.7940342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7940446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7940565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7940693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7940814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7940886Z     )
2025-05-07T20:32:35.7941175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7941274Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7941349Z         self,
2025-05-07T20:32:35.7941426Z         T: int,
2025-05-07T20:32:35.7941503Z         D: int,
2025-05-07T20:32:35.7941603Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7941700Z         contiguous: bool,
2025-05-07T20:32:35.7941785Z         compiled: bool,
2025-05-07T20:32:35.7941864Z     ) -> None:
2025-05-07T20:32:35.7941959Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7942031Z     
2025-05-07T20:32:35.7942219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7942290Z     
2025-05-07T20:32:35.7942381Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7942514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7942603Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7942684Z         x0 = x[:, :D]
2025-05-07T20:32:35.7942765Z         x1 = x[:, D:]
2025-05-07T20:32:35.7942835Z     
2025-05-07T20:32:35.7942919Z         if contiguous:
2025-05-07T20:32:35.7943017Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7943106Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7943179Z     
2025-05-07T20:32:35.7943277Z         if scale_ub is not None:
2025-05-07T20:32:35.7943385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7943531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7943605Z             )
2025-05-07T20:32:35.7943726Z         else:
2025-05-07T20:32:35.7943829Z             scale_ub_tensor = None
2025-05-07T20:32:35.7943899Z     
2025-05-07T20:32:35.7944033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7944126Z             op = silu_mul_quant
2025-05-07T20:32:35.7944214Z             if compiled:
2025-05-07T20:32:35.7944315Z                 op = torch.compile(op)
2025-05-07T20:32:35.7944427Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7944499Z     
2025-05-07T20:32:35.7944591Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7944600Z 
2025-05-07T20:32:35.7944744Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7944920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7945028Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7945129Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7945572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7945677Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7946279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7946378Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7946812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7947069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7947482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7947579Z     kernel = self.compile(
2025-05-07T20:32:35.7948040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7948236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7948414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7948422Z 
2025-05-07T20:32:35.7948656Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1688a90>
2025-05-07T20:32:35.7949629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7950252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164c820>}
2025-05-07T20:32:35.7951188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7951402Z context = <triton._C.libtriton.ir.context object at 0x7fcec1659c70>
2025-05-07T20:32:35.7951409Z 
2025-05-07T20:32:35.7951598Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7951904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7952015Z                            module_map=module_map)
2025-05-07T20:32:35.7952191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7952291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7952370Z E       ^
2025-05-07T20:32:35.7952792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7952798Z 
2025-05-07T20:32:35.7953299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7953304Z 
2025-05-07T20:32:35.7956400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7956646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7956849Z     T=4096,
2025-05-07T20:32:35.7956925Z     D=5120,
2025-05-07T20:32:35.7957007Z     scale_ub=1200.0,
2025-05-07T20:32:35.7957093Z     contiguous=True,
2025-05-07T20:32:35.7957175Z     compiled=True,
2025-05-07T20:32:35.7957250Z )
2025-05-07T20:32:35.7957470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7957644Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7957649Z 
2025-05-07T20:32:35.7957725Z     @given(
2025-05-07T20:32:35.7957889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7957987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7958141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7958259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7958373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7958449Z     )
2025-05-07T20:32:35.7958699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7958800Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7958877Z         self,
2025-05-07T20:32:35.7958952Z         T: int,
2025-05-07T20:32:35.7959031Z         D: int,
2025-05-07T20:32:35.7959128Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7959216Z         contiguous: bool,
2025-05-07T20:32:35.7959301Z         compiled: bool,
2025-05-07T20:32:35.7959379Z     ) -> None:
2025-05-07T20:32:35.7959474Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7959553Z     
2025-05-07T20:32:35.7959719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7959791Z     
2025-05-07T20:32:35.7959886Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7960011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7960103Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7960180Z         x0 = x[:, :D]
2025-05-07T20:32:35.7960258Z         x1 = x[:, D:]
2025-05-07T20:32:35.7960333Z     
2025-05-07T20:32:35.7960460Z         if contiguous:
2025-05-07T20:32:35.7960554Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7960644Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7960715Z     
2025-05-07T20:32:35.7960806Z         if scale_ub is not None:
2025-05-07T20:32:35.7960914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7961047Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7961120Z             )
2025-05-07T20:32:35.7961203Z         else:
2025-05-07T20:32:35.7961302Z             scale_ub_tensor = None
2025-05-07T20:32:35.7961374Z     
2025-05-07T20:32:35.7961503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7961592Z             op = silu_mul_quant
2025-05-07T20:32:35.7961679Z             if compiled:
2025-05-07T20:32:35.7961777Z                 op = torch.compile(op)
2025-05-07T20:32:35.7961881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7961962Z     
2025-05-07T20:32:35.7962057Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7962061Z 
2025-05-07T20:32:35.7962158Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7962292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7962393Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7962491Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7962863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7962958Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7963462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7963559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7963916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7964144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7964531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7964625Z     kernel = self.compile(
2025-05-07T20:32:35.7965006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7965181Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7965307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7965352Z 
2025-05-07T20:32:35.7965563Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1637dc0>
2025-05-07T20:32:35.7966373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7966884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164d360>}
2025-05-07T20:32:35.7967621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7967813Z context = <triton._C.libtriton.ir.context object at 0x7fcec1584b30>
2025-05-07T20:32:35.7967818Z 
2025-05-07T20:32:35.7967985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7968252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7968357Z                            module_map=module_map)
2025-05-07T20:32:35.7968516Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7968619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7968698Z E       ^
2025-05-07T20:32:35.7969093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7969102Z 
2025-05-07T20:32:35.7969518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7969523Z 
2025-05-07T20:32:35.7969624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7969847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7969925Z     T=128,
2025-05-07T20:32:35.7970000Z     D=5120,
2025-05-07T20:32:35.7970085Z     scale_ub=1200.0,
2025-05-07T20:32:35.7970171Z     contiguous=False,
2025-05-07T20:32:35.7970253Z     compiled=True,
2025-05-07T20:32:35.7970328Z )
2025-05-07T20:32:35.7970539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7970713Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7970722Z 
2025-05-07T20:32:35.7970795Z     @given(
2025-05-07T20:32:35.7970913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7971011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7971138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7971270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7971410Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7971482Z     )
2025-05-07T20:32:35.7971724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7971824Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7971901Z         self,
2025-05-07T20:32:35.7971981Z         T: int,
2025-05-07T20:32:35.7972057Z         D: int,
2025-05-07T20:32:35.7972154Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7972252Z         contiguous: bool,
2025-05-07T20:32:35.7972337Z         compiled: bool,
2025-05-07T20:32:35.7972458Z     ) -> None:
2025-05-07T20:32:35.7972558Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7972631Z     
2025-05-07T20:32:35.7972796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7972869Z     
2025-05-07T20:32:35.7972958Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7973082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7973173Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7973255Z         x0 = x[:, :D]
2025-05-07T20:32:35.7973339Z         x1 = x[:, D:]
2025-05-07T20:32:35.7973409Z     
2025-05-07T20:32:35.7973534Z         if contiguous:
2025-05-07T20:32:35.7973626Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7973753Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7973824Z     
2025-05-07T20:32:35.7973919Z         if scale_ub is not None:
2025-05-07T20:32:35.7974025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7974158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7974237Z             )
2025-05-07T20:32:35.7974312Z         else:
2025-05-07T20:32:35.7974406Z             scale_ub_tensor = None
2025-05-07T20:32:35.7974479Z     
2025-05-07T20:32:35.7974606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7974693Z             op = silu_mul_quant
2025-05-07T20:32:35.7974784Z             if compiled:
2025-05-07T20:32:35.7974883Z                 op = torch.compile(op)
2025-05-07T20:32:35.7974989Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7975058Z     
2025-05-07T20:32:35.7975152Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7975157Z 
2025-05-07T20:32:35.7975259Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7975387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7975487Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7975587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7976008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7976103Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7976599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7976700Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7977059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7977278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7977626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7977719Z     kernel = self.compile(
2025-05-07T20:32:35.7978098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7978269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7978401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7978405Z 
2025-05-07T20:32:35.7978610Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec15d3d90>
2025-05-07T20:32:35.7979374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7979987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164e290>}
2025-05-07T20:32:35.7980730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7980971Z context = <triton._C.libtriton.ir.context object at 0x7fcec158bdf0>
2025-05-07T20:32:35.7980976Z 
2025-05-07T20:32:35.7981136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7981394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7981502Z                            module_map=module_map)
2025-05-07T20:32:35.7981665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7981762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7981832Z E       ^
2025-05-07T20:32:35.7982227Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7982269Z 
2025-05-07T20:32:35.7982689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7982694Z 
2025-05-07T20:32:35.7982794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7983019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7983093Z     T=16384,
2025-05-07T20:32:35.7983167Z     D=7168,
2025-05-07T20:32:35.7983251Z     scale_ub=1200.0,
2025-05-07T20:32:35.7983332Z     contiguous=True,
2025-05-07T20:32:35.7983409Z     compiled=True,
2025-05-07T20:32:35.7983484Z )
2025-05-07T20:32:35.7983697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7983874Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7983883Z 
2025-05-07T20:32:35.7983955Z     @given(
2025-05-07T20:32:35.7984070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7984168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7984286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7984399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7984514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7984588Z     )
2025-05-07T20:32:35.7984899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7984993Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7985063Z         self,
2025-05-07T20:32:35.7985136Z         T: int,
2025-05-07T20:32:35.7985215Z         D: int,
2025-05-07T20:32:35.7985310Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7985402Z         contiguous: bool,
2025-05-07T20:32:35.7985488Z         compiled: bool,
2025-05-07T20:32:35.7985565Z     ) -> None:
2025-05-07T20:32:35.7985667Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7985737Z     
2025-05-07T20:32:35.7985904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7985984Z     
2025-05-07T20:32:35.7986077Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7986197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7986285Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7986366Z         x0 = x[:, :D]
2025-05-07T20:32:35.7986445Z         x1 = x[:, D:]
2025-05-07T20:32:35.7986521Z     
2025-05-07T20:32:35.7986601Z         if contiguous:
2025-05-07T20:32:35.7986691Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7986787Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7986858Z     
2025-05-07T20:32:35.7986953Z         if scale_ub is not None:
2025-05-07T20:32:35.7987063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7987198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7987278Z             )
2025-05-07T20:32:35.7987357Z         else:
2025-05-07T20:32:35.7987451Z             scale_ub_tensor = None
2025-05-07T20:32:35.7987527Z     
2025-05-07T20:32:35.7987656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7987744Z             op = silu_mul_quant
2025-05-07T20:32:35.7987835Z             if compiled:
2025-05-07T20:32:35.7987933Z                 op = torch.compile(op)
2025-05-07T20:32:35.7988089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7988169Z     
2025-05-07T20:32:35.7988263Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7988267Z 
2025-05-07T20:32:35.7988373Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7988503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7988605Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7988708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7989072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7989208Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7989744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7990077Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7990482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7990713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7991057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7991153Z     kernel = self.compile(
2025-05-07T20:32:35.7991534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7991709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7991840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7991844Z 
2025-05-07T20:32:35.7992054Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec15b7e20>
2025-05-07T20:32:35.7992912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7993422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164ed40>}
2025-05-07T20:32:35.7994164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7994355Z context = <triton._C.libtriton.ir.context object at 0x7fcec13e1bf0>
2025-05-07T20:32:35.7994361Z 
2025-05-07T20:32:35.7994528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7994793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7994898Z                            module_map=module_map)
2025-05-07T20:32:35.7995061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7995167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7995245Z E       ^
2025-05-07T20:32:35.7995601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7995606Z 
2025-05-07T20:32:35.7996023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7996027Z 
2025-05-07T20:32:35.7996131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7996353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7996433Z     T=16384,
2025-05-07T20:32:35.7996513Z     D=5120,
2025-05-07T20:32:35.7996600Z     scale_ub=1200.0,
2025-05-07T20:32:35.7996688Z     contiguous=True,
2025-05-07T20:32:35.7996774Z     compiled=False,
2025-05-07T20:32:35.7996845Z )
2025-05-07T20:32:35.7997060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7997311Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.7997316Z 
2025-05-07T20:32:35.7997392Z     @given(
2025-05-07T20:32:35.7997509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7997612Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7997726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7997846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7997957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7998105Z     )
2025-05-07T20:32:35.7998354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7998500Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7998578Z         self,
2025-05-07T20:32:35.7998659Z         T: int,
2025-05-07T20:32:35.7998734Z         D: int,
2025-05-07T20:32:35.7998833Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7998930Z         contiguous: bool,
2025-05-07T20:32:35.7999024Z         compiled: bool,
2025-05-07T20:32:35.7999102Z     ) -> None:
2025-05-07T20:32:35.7999199Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7999273Z     
2025-05-07T20:32:35.7999437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7999512Z     
2025-05-07T20:32:35.7999602Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7999731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7999820Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7999902Z         x0 = x[:, :D]
2025-05-07T20:32:35.7999987Z         x1 = x[:, D:]
2025-05-07T20:32:35.8000058Z     
2025-05-07T20:32:35.8000142Z         if contiguous:
2025-05-07T20:32:35.8000242Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8000331Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8000402Z     
2025-05-07T20:32:35.8000493Z         if scale_ub is not None:
2025-05-07T20:32:35.8000600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8000778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8000858Z             )
2025-05-07T20:32:35.8000934Z         else:
2025-05-07T20:32:35.8001028Z             scale_ub_tensor = None
2025-05-07T20:32:35.8001099Z     
2025-05-07T20:32:35.8001226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8001318Z             op = silu_mul_quant
2025-05-07T20:32:35.8001404Z             if compiled:
2025-05-07T20:32:35.8001504Z                 op = torch.compile(op)
2025-05-07T20:32:35.8001613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8001686Z     
2025-05-07T20:32:35.8001777Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8001781Z 
2025-05-07T20:32:35.8001885Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8002018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8002121Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8002220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8002722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8002825Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8003184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8003405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8003745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8003842Z     kernel = self.compile(
2025-05-07T20:32:35.8004236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8004412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8004537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8004586Z 
2025-05-07T20:32:35.8004798Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec15b4340>
2025-05-07T20:32:35.8005564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8006080Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec164fac0>}
2025-05-07T20:32:35.8006906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8007096Z context = <triton._C.libtriton.ir.context object at 0x7fcec1367030>
2025-05-07T20:32:35.8007104Z 
2025-05-07T20:32:35.8007272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8007537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8007645Z                            module_map=module_map)
2025-05-07T20:32:35.8007804Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8007903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8007984Z E       ^
2025-05-07T20:32:35.8008340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8008347Z 
2025-05-07T20:32:35.8008767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8008772Z 
2025-05-07T20:32:35.8008877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8009097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8009180Z     T=1,
2025-05-07T20:32:35.8009295Z     D=7168,
2025-05-07T20:32:35.8009380Z     scale_ub=1200.0,
2025-05-07T20:32:35.8009467Z     contiguous=False,
2025-05-07T20:32:35.8009550Z     compiled=False,
2025-05-07T20:32:35.8009623Z )
2025-05-07T20:32:35.8009839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8010004Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.8010009Z 
2025-05-07T20:32:35.8010086Z     @given(
2025-05-07T20:32:35.8010202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8010303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8010426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8010541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8010656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8010734Z     )
2025-05-07T20:32:35.8010977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8011075Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8011158Z         self,
2025-05-07T20:32:35.8011235Z         T: int,
2025-05-07T20:32:35.8011315Z         D: int,
2025-05-07T20:32:35.8011415Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8011503Z         contiguous: bool,
2025-05-07T20:32:35.8011591Z         compiled: bool,
2025-05-07T20:32:35.8011668Z     ) -> None:
2025-05-07T20:32:35.8011763Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8011840Z     
2025-05-07T20:32:35.8012008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8012086Z     
2025-05-07T20:32:35.8012181Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8012307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8012393Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8012479Z         x0 = x[:, :D]
2025-05-07T20:32:35.8012556Z         x1 = x[:, D:]
2025-05-07T20:32:35.8012632Z     
2025-05-07T20:32:35.8012762Z         if contiguous:
2025-05-07T20:32:35.8012857Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8012951Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8013022Z     
2025-05-07T20:32:35.8013110Z         if scale_ub is not None:
2025-05-07T20:32:35.8013220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8013352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8013427Z             )
2025-05-07T20:32:35.8013505Z         else:
2025-05-07T20:32:35.8013598Z             scale_ub_tensor = None
2025-05-07T20:32:35.8013718Z     
2025-05-07T20:32:35.8013849Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8014001Z             op = silu_mul_quant
2025-05-07T20:32:35.8014090Z             if compiled:
2025-05-07T20:32:35.8014191Z                 op = torch.compile(op)
2025-05-07T20:32:35.8014298Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8014375Z     
2025-05-07T20:32:35.8014468Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8014475Z 
2025-05-07T20:32:35.8014578Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8014705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8014807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8014913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8015407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8015505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8015877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8016098Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8016448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8016540Z     kernel = self.compile(
2025-05-07T20:32:35.8016967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8017145Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8017268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8017273Z 
2025-05-07T20:32:35.8017475Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1165990>
2025-05-07T20:32:35.8018246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8018753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a44c0>}
2025-05-07T20:32:35.8019494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8019685Z context = <triton._C.libtriton.ir.context object at 0x7fcec1176630>
2025-05-07T20:32:35.8019690Z 
2025-05-07T20:32:35.8019945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8020205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8020311Z                            module_map=module_map)
2025-05-07T20:32:35.8020481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8020581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8020661Z E       ^
2025-05-07T20:32:35.8021019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8021023Z 
2025-05-07T20:32:35.8021435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8021491Z 
2025-05-07T20:32:35.8021601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8021824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8021899Z     T=4096,
2025-05-07T20:32:35.8021976Z     D=7168,
2025-05-07T20:32:35.8022058Z     scale_ub=1200.0,
2025-05-07T20:32:35.8022141Z     contiguous=False,
2025-05-07T20:32:35.8022226Z     compiled=True,
2025-05-07T20:32:35.8022298Z )
2025-05-07T20:32:35.8022557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8022767Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.8022772Z 
2025-05-07T20:32:35.8022846Z     @given(
2025-05-07T20:32:35.8022968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8023070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8023192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8023314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8023430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8023508Z     )
2025-05-07T20:32:35.8023752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8023846Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8023926Z         self,
2025-05-07T20:32:35.8024001Z         T: int,
2025-05-07T20:32:35.8024075Z         D: int,
2025-05-07T20:32:35.8024182Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8024271Z         contiguous: bool,
2025-05-07T20:32:35.8024356Z         compiled: bool,
2025-05-07T20:32:35.8024442Z     ) -> None:
2025-05-07T20:32:35.8024536Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8024607Z     
2025-05-07T20:32:35.8024776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8024848Z     
2025-05-07T20:32:35.8024987Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8025116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8025205Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8025287Z         x0 = x[:, :D]
2025-05-07T20:32:35.8025365Z         x1 = x[:, D:]
2025-05-07T20:32:35.8025437Z     
2025-05-07T20:32:35.8025525Z         if contiguous:
2025-05-07T20:32:35.8025615Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8025701Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8025776Z     
2025-05-07T20:32:35.8025868Z         if scale_ub is not None:
2025-05-07T20:32:35.8025977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8026117Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8026192Z             )
2025-05-07T20:32:35.8026266Z         else:
2025-05-07T20:32:35.8026363Z             scale_ub_tensor = None
2025-05-07T20:32:35.8026435Z     
2025-05-07T20:32:35.8026565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8026661Z             op = silu_mul_quant
2025-05-07T20:32:35.8026746Z             if compiled:
2025-05-07T20:32:35.8026846Z                 op = torch.compile(op)
2025-05-07T20:32:35.8026952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8027024Z     
2025-05-07T20:32:35.8027122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8027127Z 
2025-05-07T20:32:35.8027224Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8027351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8027455Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8027554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8027924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.8028017Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.8028504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8028652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8029015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8029236Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8029588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8029682Z     kernel = self.compile(
2025-05-07T20:32:35.8030070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8030322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8030449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8030454Z 
2025-05-07T20:32:35.8030662Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1166d70>
2025-05-07T20:32:35.8031498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8032011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a51b0>}
2025-05-07T20:32:35.8032750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8032946Z context = <triton._C.libtriton.ir.context object at 0x7fcec115b0f0>
2025-05-07T20:32:35.8032953Z 
2025-05-07T20:32:35.8033124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8033428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8033545Z                            module_map=module_map)
2025-05-07T20:32:35.8033706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8033803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8033881Z E       ^
2025-05-07T20:32:35.8034233Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8034237Z 
2025-05-07T20:32:35.8034649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8034656Z 
2025-05-07T20:32:35.8034762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8034981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8035059Z     T=128,
2025-05-07T20:32:35.8035136Z     D=7168,
2025-05-07T20:32:35.8035221Z     scale_ub=1200.0,
2025-05-07T20:32:35.8035316Z     contiguous=False,
2025-05-07T20:32:35.8035403Z     compiled=True,
2025-05-07T20:32:35.8035477Z )
2025-05-07T20:32:35.8035695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8035867Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.8035871Z 
2025-05-07T20:32:35.8035949Z     @given(
2025-05-07T20:32:35.8036065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8036161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8036277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8036397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8036512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8036587Z     )
2025-05-07T20:32:35.8036828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8036923Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8036997Z         self,
2025-05-07T20:32:35.8037117Z         T: int,
2025-05-07T20:32:35.8037198Z         D: int,
2025-05-07T20:32:35.8037296Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8037384Z         contiguous: bool,
2025-05-07T20:32:35.8037473Z         compiled: bool,
2025-05-07T20:32:35.8037550Z     ) -> None:
2025-05-07T20:32:35.8037643Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8037717Z     
2025-05-07T20:32:35.8037888Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8037962Z     
2025-05-07T20:32:35.8038056Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8038225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8038315Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8038434Z         x0 = x[:, :D]
2025-05-07T20:32:35.8038515Z         x1 = x[:, D:]
2025-05-07T20:32:35.8038591Z     
2025-05-07T20:32:35.8038672Z         if contiguous:
2025-05-07T20:32:35.8038763Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8038856Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8038928Z     
2025-05-07T20:32:35.8039017Z         if scale_ub is not None:
2025-05-07T20:32:35.8039124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8039257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8039331Z             )
2025-05-07T20:32:35.8039408Z         else:
2025-05-07T20:32:35.8039503Z             scale_ub_tensor = None
2025-05-07T20:32:35.8039574Z     
2025-05-07T20:32:35.8039705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8039797Z             op = silu_mul_quant
2025-05-07T20:32:35.8039885Z             if compiled:
2025-05-07T20:32:35.8039984Z                 op = torch.compile(op)
2025-05-07T20:32:35.8040093Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8040166Z     
2025-05-07T20:32:35.8040260Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8040265Z 
2025-05-07T20:32:35.8040361Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8040537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8040640Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8040739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8041106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.8041197Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.8041687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8041786Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8042146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8042368Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8042710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8042807Z     kernel = self.compile(
2025-05-07T20:32:35.8043196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8043371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8043498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8043503Z 
2025-05-07T20:32:35.8043708Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec1163ac0>
2025-05-07T20:32:35.8044480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8044980Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a40d0>}
2025-05-07T20:32:35.8045779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8045973Z context = <triton._C.libtriton.ir.context object at 0x7fcec1429670>
2025-05-07T20:32:35.8045978Z 
2025-05-07T20:32:35.8046141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8046407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8046578Z                            module_map=module_map)
2025-05-07T20:32:35.8046777Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8046876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8046954Z E       ^
2025-05-07T20:32:35.8047309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8047317Z 
2025-05-07T20:32:35.8047741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8047745Z 
2025-05-07T20:32:35.8047847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8048069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8048144Z     T=2048,
2025-05-07T20:32:35.8048218Z     D=7168,
2025-05-07T20:32:35.8048300Z     scale_ub=None,
2025-05-07T20:32:35.8048383Z     contiguous=True,
2025-05-07T20:32:35.8048468Z     compiled=True,
2025-05-07T20:32:35.8048541Z )
2025-05-07T20:32:35.8048760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8048931Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.8048938Z 
2025-05-07T20:32:35.8049012Z     @given(
2025-05-07T20:32:35.8049129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8049271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8049388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8049507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8049624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8049696Z     )
2025-05-07T20:32:35.8049939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8050036Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8050112Z         self,
2025-05-07T20:32:35.8050192Z         T: int,
2025-05-07T20:32:35.8050270Z         D: int,
2025-05-07T20:32:35.8050367Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8050466Z         contiguous: bool,
2025-05-07T20:32:35.8050552Z         compiled: bool,
2025-05-07T20:32:35.8050629Z     ) -> None:
2025-05-07T20:32:35.8050724Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8050795Z     
2025-05-07T20:32:35.8050960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8051038Z     
2025-05-07T20:32:35.8051129Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8051250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8051340Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8051418Z         x0 = x[:, :D]
2025-05-07T20:32:35.8051495Z         x1 = x[:, D:]
2025-05-07T20:32:35.8051569Z     
2025-05-07T20:32:35.8051650Z         if contiguous:
2025-05-07T20:32:35.8051744Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8051832Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8051905Z     
2025-05-07T20:32:35.8051996Z         if scale_ub is not None:
2025-05-07T20:32:35.8052101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8052236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8052316Z             )
2025-05-07T20:32:35.8052391Z         else:
2025-05-07T20:32:35.8052486Z             scale_ub_tensor = None
2025-05-07T20:32:35.8052561Z     
2025-05-07T20:32:35.8052738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8052829Z             op = silu_mul_quant
2025-05-07T20:32:35.8052919Z             if compiled:
2025-05-07T20:32:35.8053018Z                 op = torch.compile(op)
2025-05-07T20:32:35.8053125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8053196Z     
2025-05-07T20:32:35.8053287Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8053292Z 
2025-05-07T20:32:35.8053394Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8053521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8053665Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8053801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8054164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.8054256Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.8054752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8054850Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8055211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8055435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8055779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8055882Z     kernel = self.compile(
2025-05-07T20:32:35.8056271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8056448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8056574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8056579Z 
2025-05-07T20:32:35.8056830Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec149bac0>
2025-05-07T20:32:35.8057615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8058120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec13a6560>}
2025-05-07T20:32:35.8058870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8059059Z context = <triton._C.libtriton.ir.context object at 0x7fcec1478630>
2025-05-07T20:32:35.8059063Z 
2025-05-07T20:32:35.8059226Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8059498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8059604Z                            module_map=module_map)
2025-05-07T20:32:35.8059833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8059956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8060031Z E       ^
2025-05-07T20:32:35.8060387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8060391Z 
2025-05-07T20:32:35.8060800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8060807Z 
2025-05-07T20:32:35.8060915Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8061161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8061251Z     T=16384,
2025-05-07T20:32:35.8061332Z     D=5120,
2025-05-07T20:32:35.8061464Z     scale_ub=None,
2025-05-07T20:32:35.8061551Z     contiguous=False,
2025-05-07T20:32:35.8061638Z     compiled=False,
2025-05-07T20:32:35.8061706Z )
2025-05-07T20:32:35.8061916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8062095Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.8062100Z 
2025-05-07T20:32:35.8062170Z     @given(
2025-05-07T20:32:35.8062290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8062387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8062542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8062698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8062811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8062882Z     )
2025-05-07T20:32:35.8063129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8063222Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8063296Z         self,
2025-05-07T20:32:35.8063370Z         T: int,
2025-05-07T20:32:35.8063443Z         D: int,
2025-05-07T20:32:35.8063539Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8063629Z         contiguous: bool,
2025-05-07T20:32:35.8063709Z         compiled: bool,
2025-05-07T20:32:35.8063785Z     ) -> None:
2025-05-07T20:32:35.8063879Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8063948Z     
2025-05-07T20:32:35.8064118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8064191Z     
2025-05-07T20:32:35.8064280Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8064407Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8066239Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8066249Z 
2025-05-07T20:32:35.8066369Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.8066374Z 
2025-05-07T20:32:35.8066471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8066690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8066765Z     T=4096,
2025-05-07T20:32:35.8066838Z     D=7168,
2025-05-07T20:32:35.8066922Z     scale_ub=1200.0,
2025-05-07T20:32:35.8067002Z     contiguous=True,
2025-05-07T20:32:35.8067081Z     compiled=True,
2025-05-07T20:32:35.8067155Z )
2025-05-07T20:32:35.8067364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8067536Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.8067541Z 
2025-05-07T20:32:35.8067611Z     @given(
2025-05-07T20:32:35.8067723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8067819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8067936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8068050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8068163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8068236Z     )
2025-05-07T20:32:35.8068482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8068577Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8068652Z         self,
2025-05-07T20:32:35.8068724Z         T: int,
2025-05-07T20:32:35.8068797Z         D: int,
2025-05-07T20:32:35.8068892Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8068979Z         contiguous: bool,
2025-05-07T20:32:35.8069110Z         compiled: bool,
2025-05-07T20:32:35.8069185Z     ) -> None:
2025-05-07T20:32:35.8069278Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8069350Z     
2025-05-07T20:32:35.8069513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8069590Z     
2025-05-07T20:32:35.8069681Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8069803Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8071620Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8071668Z 
2025-05-07T20:32:35.8071786Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.8071791Z 
2025-05-07T20:32:35.8071892Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8072109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8072182Z     T=16384,
2025-05-07T20:32:35.8072258Z     D=7168,
2025-05-07T20:32:35.8072336Z     scale_ub=None,
2025-05-07T20:32:35.8072420Z     contiguous=False,
2025-05-07T20:32:35.8072508Z     compiled=False,
2025-05-07T20:32:35.8072587Z )
2025-05-07T20:32:35.8072801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8072976Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.8072981Z 
2025-05-07T20:32:35.8073052Z     @given(
2025-05-07T20:32:35.8073170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8073264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8073419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8073536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8073647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8073715Z     )
2025-05-07T20:32:35.8073965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8074055Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8074135Z         self,
2025-05-07T20:32:35.8074206Z         T: int,
2025-05-07T20:32:35.8074282Z         D: int,
2025-05-07T20:32:35.8074379Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8074466Z         contiguous: bool,
2025-05-07T20:32:35.8074551Z         compiled: bool,
2025-05-07T20:32:35.8077444Z     ) -> None:
2025-05-07T20:32:35.8077558Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8077634Z     
2025-05-07T20:32:35.8077805Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8079621Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8079634Z 
2025-05-07T20:32:35.8079754Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8079759Z 
2025-05-07T20:32:35.8079865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8080090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8080166Z     T=2048,
2025-05-07T20:32:35.8080247Z     D=7168,
2025-05-07T20:32:35.8080329Z     scale_ub=1200.0,
2025-05-07T20:32:35.8080502Z     contiguous=True,
2025-05-07T20:32:35.8080588Z     compiled=True,
2025-05-07T20:32:35.8080660Z )
2025-05-07T20:32:35.8080873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8081043Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.8081048Z 
2025-05-07T20:32:35.8081123Z     @given(
2025-05-07T20:32:35.8081239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8081342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8081499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8081619Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8081771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8081846Z     )
2025-05-07T20:32:35.8082097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8082189Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8082268Z         self,
2025-05-07T20:32:35.8082352Z         T: int,
2025-05-07T20:32:35.8082429Z         D: int,
2025-05-07T20:32:35.8082526Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8082617Z         contiguous: bool,
2025-05-07T20:32:35.8082702Z         compiled: bool,
2025-05-07T20:32:35.8082781Z     ) -> None:
2025-05-07T20:32:35.8082878Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8082949Z     
2025-05-07T20:32:35.8083114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8083188Z     
2025-05-07T20:32:35.8083285Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8083412Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8085217Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8085228Z 
2025-05-07T20:32:35.8085351Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.8085356Z 
2025-05-07T20:32:35.8085457Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8085676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8085760Z     T=2048,
2025-05-07T20:32:35.8085835Z     D=7168,
2025-05-07T20:32:35.8085917Z     scale_ub=None,
2025-05-07T20:32:35.8086012Z     contiguous=True,
2025-05-07T20:32:35.8086096Z     compiled=False,
2025-05-07T20:32:35.8086167Z )
2025-05-07T20:32:35.8086383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8086553Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8086560Z 
2025-05-07T20:32:35.8086636Z     @given(
2025-05-07T20:32:35.8086751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8086858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8086974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8087094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8087207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8087286Z     )
2025-05-07T20:32:35.8087527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8087630Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8087708Z         self,
2025-05-07T20:32:35.8087785Z         T: int,
2025-05-07T20:32:35.8087862Z         D: int,
2025-05-07T20:32:35.8087963Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8088050Z         contiguous: bool,
2025-05-07T20:32:35.8088140Z         compiled: bool,
2025-05-07T20:32:35.8088264Z     ) -> None:
2025-05-07T20:32:35.8088362Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8088441Z     
2025-05-07T20:32:35.8088606Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8088684Z     
2025-05-07T20:32:35.8088782Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.8090882Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8090945Z 
2025-05-07T20:32:35.8091093Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.8091105Z 
2025-05-07T20:32:35.8091217Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8091464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8091541Z     T=1,
2025-05-07T20:32:35.8091615Z     D=7168,
2025-05-07T20:32:35.8091700Z     scale_ub=1200.0,
2025-05-07T20:32:35.8091793Z     contiguous=True,
2025-05-07T20:32:35.8091877Z     compiled=False,
2025-05-07T20:32:35.8091953Z )
2025-05-07T20:32:35.8092169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8092335Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8092340Z 
2025-05-07T20:32:35.8092423Z     @given(
2025-05-07T20:32:35.8092539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8092640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8092754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8092930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8093056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8093129Z     )
2025-05-07T20:32:35.8093371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8093470Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8093545Z         self,
2025-05-07T20:32:35.8093621Z         T: int,
2025-05-07T20:32:35.8093698Z         D: int,
2025-05-07T20:32:35.8093795Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8093885Z         contiguous: bool,
2025-05-07T20:32:35.8093984Z         compiled: bool,
2025-05-07T20:32:35.8094063Z     ) -> None:
2025-05-07T20:32:35.8094163Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8094237Z     
2025-05-07T20:32:35.8094407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8094483Z     
2025-05-07T20:32:35.8094575Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8094699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8094803Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8094889Z         x0 = x[:, :D]
2025-05-07T20:32:35.8094968Z         x1 = x[:, D:]
2025-05-07T20:32:35.8095042Z     
2025-05-07T20:32:35.8095127Z         if contiguous:
2025-05-07T20:32:35.8095217Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8095307Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8095377Z     
2025-05-07T20:32:35.8095468Z         if scale_ub is not None:
2025-05-07T20:32:35.8095581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8095720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8095802Z             )
2025-05-07T20:32:35.8095880Z         else:
2025-05-07T20:32:35.8095975Z             scale_ub_tensor = None
2025-05-07T20:32:35.8096055Z     
2025-05-07T20:32:35.8096184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8096271Z             op = silu_mul_quant
2025-05-07T20:32:35.8096426Z             if compiled:
2025-05-07T20:32:35.8096531Z                 op = torch.compile(op)
2025-05-07T20:32:35.8096637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8096710Z     
2025-05-07T20:32:35.8096799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8096803Z 
2025-05-07T20:32:35.8096902Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8097030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8097128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8097228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8097819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8097919Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8098282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8098510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8098866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8098962Z     kernel = self.compile(
2025-05-07T20:32:35.8099340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8099513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8099642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8099650Z 
2025-05-07T20:32:35.8099936Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec148b520>
2025-05-07T20:32:35.8100711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8101265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f644c0>}
2025-05-07T20:32:35.8102018Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8102209Z context = <triton._C.libtriton.ir.context object at 0x7fcec0fb65f0>
2025-05-07T20:32:35.8102214Z 
2025-05-07T20:32:35.8102379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8102652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8102755Z                            module_map=module_map)
2025-05-07T20:32:35.8102914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8103013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8103090Z E       ^
2025-05-07T20:32:35.8103440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8103444Z 
2025-05-07T20:32:35.8103858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8103863Z 
2025-05-07T20:32:35.8103962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8104180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8104259Z     T=128,
2025-05-07T20:32:35.8104333Z     D=5120,
2025-05-07T20:32:35.8104419Z     scale_ub=None,
2025-05-07T20:32:35.8104500Z     contiguous=True,
2025-05-07T20:32:35.8104580Z     compiled=False,
2025-05-07T20:32:35.8104653Z )
2025-05-07T20:32:35.8104867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8105032Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8105085Z 
2025-05-07T20:32:35.8105159Z     @given(
2025-05-07T20:32:35.8105274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8105373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8105485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8105597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8105715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8105785Z     )
2025-05-07T20:32:35.8106024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8106162Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8106238Z         self,
2025-05-07T20:32:35.8106349Z         T: int,
2025-05-07T20:32:35.8106424Z         D: int,
2025-05-07T20:32:35.8106520Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8106607Z         contiguous: bool,
2025-05-07T20:32:35.8106690Z         compiled: bool,
2025-05-07T20:32:35.8106763Z     ) -> None:
2025-05-07T20:32:35.8106865Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8106936Z     
2025-05-07T20:32:35.8107099Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8107172Z     
2025-05-07T20:32:35.8107260Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8107380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8107471Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8107547Z         x0 = x[:, :D]
2025-05-07T20:32:35.8107623Z         x1 = x[:, D:]
2025-05-07T20:32:35.8107696Z     
2025-05-07T20:32:35.8107781Z         if contiguous:
2025-05-07T20:32:35.8107874Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8107963Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8108033Z     
2025-05-07T20:32:35.8108124Z         if scale_ub is not None:
2025-05-07T20:32:35.8108228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8108358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8108434Z             )
2025-05-07T20:32:35.8108577Z         else:
2025-05-07T20:32:35.8108669Z             scale_ub_tensor = None
2025-05-07T20:32:35.8108744Z     
2025-05-07T20:32:35.8108870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8108957Z             op = silu_mul_quant
2025-05-07T20:32:35.8109045Z             if compiled:
2025-05-07T20:32:35.8109148Z                 op = torch.compile(op)
2025-05-07T20:32:35.8109254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8109320Z     
2025-05-07T20:32:35.8109409Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8109416Z 
2025-05-07T20:32:35.8109515Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8109643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8109740Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8109839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8110335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8110430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8110793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8111009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8111346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8111440Z     kernel = self.compile(
2025-05-07T20:32:35.8111826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8112004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8112128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8112133Z 
2025-05-07T20:32:35.8112341Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0f317e0>
2025-05-07T20:32:35.8113166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8113670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f64940>}
2025-05-07T20:32:35.8114448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8114672Z context = <triton._C.libtriton.ir.context object at 0x7fcec0f5e2f0>
2025-05-07T20:32:35.8114677Z 
2025-05-07T20:32:35.8114846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8115112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8115220Z                            module_map=module_map)
2025-05-07T20:32:35.8115380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8115474Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8115553Z E       ^
2025-05-07T20:32:35.8115901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8115906Z 
2025-05-07T20:32:35.8116320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8116325Z 
2025-05-07T20:32:35.8116432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8116647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8116727Z     T=128,
2025-05-07T20:32:35.8116798Z     D=7168,
2025-05-07T20:32:35.8116878Z     scale_ub=None,
2025-05-07T20:32:35.8117005Z     contiguous=True,
2025-05-07T20:32:35.8117086Z     compiled=False,
2025-05-07T20:32:35.8117157Z )
2025-05-07T20:32:35.8117369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8117534Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8117539Z 
2025-05-07T20:32:35.8117609Z     @given(
2025-05-07T20:32:35.8117727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8117821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8117942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8118056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8118170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8118246Z     )
2025-05-07T20:32:35.8118493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8118593Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8118668Z         self,
2025-05-07T20:32:35.8118745Z         T: int,
2025-05-07T20:32:35.8118818Z         D: int,
2025-05-07T20:32:35.8118919Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8119006Z         contiguous: bool,
2025-05-07T20:32:35.8119088Z         compiled: bool,
2025-05-07T20:32:35.8119166Z     ) -> None:
2025-05-07T20:32:35.8119257Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8119327Z     
2025-05-07T20:32:35.8119491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8119560Z     
2025-05-07T20:32:35.8119656Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8119780Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8119870Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8119948Z         x0 = x[:, :D]
2025-05-07T20:32:35.8120024Z         x1 = x[:, D:]
2025-05-07T20:32:35.8120094Z     
2025-05-07T20:32:35.8120181Z         if contiguous:
2025-05-07T20:32:35.8120270Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8120404Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8120477Z     
2025-05-07T20:32:35.8120566Z         if scale_ub is not None:
2025-05-07T20:32:35.8120666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8120800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8120876Z             )
2025-05-07T20:32:35.8120958Z         else:
2025-05-07T20:32:35.8121049Z             scale_ub_tensor = None
2025-05-07T20:32:35.8121117Z     
2025-05-07T20:32:35.8121246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8121376Z             op = silu_mul_quant
2025-05-07T20:32:35.8121459Z             if compiled:
2025-05-07T20:32:35.8121603Z                 op = torch.compile(op)
2025-05-07T20:32:35.8121708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8121779Z     
2025-05-07T20:32:35.8121869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8121874Z 
2025-05-07T20:32:35.8121968Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8122100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8122201Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8122295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8122786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8122879Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8123231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8123456Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8123796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8123888Z     kernel = self.compile(
2025-05-07T20:32:35.8124309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8124485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8124610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8124616Z 
2025-05-07T20:32:35.8124820Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0f46da0>
2025-05-07T20:32:35.8125590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8126093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f65240>}
2025-05-07T20:32:35.8126843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8127035Z context = <triton._C.libtriton.ir.context object at 0x7fcec1019830>
2025-05-07T20:32:35.8127039Z 
2025-05-07T20:32:35.8127202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8127464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8127567Z                            module_map=module_map)
2025-05-07T20:32:35.8127725Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8127826Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8127903Z E       ^
2025-05-07T20:32:35.8128254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8128263Z 
2025-05-07T20:32:35.8128669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8128717Z 
2025-05-07T20:32:35.8128820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8129041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8129120Z     T=2048,
2025-05-07T20:32:35.8129190Z     D=7168,
2025-05-07T20:32:35.8129270Z     scale_ub=1200.0,
2025-05-07T20:32:35.8129350Z     contiguous=True,
2025-05-07T20:32:35.8129430Z     compiled=False,
2025-05-07T20:32:35.8129502Z )
2025-05-07T20:32:35.8129713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8129933Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8129938Z 
2025-05-07T20:32:35.8130047Z     @given(
2025-05-07T20:32:35.8130164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8130263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8130376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8130498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8130612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8130683Z     )
2025-05-07T20:32:35.8130923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8131016Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8131090Z         self,
2025-05-07T20:32:35.8131166Z         T: int,
2025-05-07T20:32:35.8131239Z         D: int,
2025-05-07T20:32:35.8131336Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8131424Z         contiguous: bool,
2025-05-07T20:32:35.8131509Z         compiled: bool,
2025-05-07T20:32:35.8131585Z     ) -> None:
2025-05-07T20:32:35.8131685Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8131753Z     
2025-05-07T20:32:35.8131920Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8133724Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8133734Z 
2025-05-07T20:32:35.8133850Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8133857Z 
2025-05-07T20:32:35.8133960Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8134179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8134256Z     T=1,
2025-05-07T20:32:35.8134330Z     D=5120,
2025-05-07T20:32:35.8134415Z     scale_ub=1200.0,
2025-05-07T20:32:35.8134497Z     contiguous=True,
2025-05-07T20:32:35.8134578Z     compiled=False,
2025-05-07T20:32:35.8134648Z )
2025-05-07T20:32:35.8134864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8135026Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8135031Z 
2025-05-07T20:32:35.8135102Z     @given(
2025-05-07T20:32:35.8135221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8135315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8135429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8135541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8135656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8135734Z     )
2025-05-07T20:32:35.8135981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8136072Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8136150Z         self,
2025-05-07T20:32:35.8136225Z         T: int,
2025-05-07T20:32:35.8136300Z         D: int,
2025-05-07T20:32:35.8136445Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8136532Z         contiguous: bool,
2025-05-07T20:32:35.8136615Z         compiled: bool,
2025-05-07T20:32:35.8136692Z     ) -> None:
2025-05-07T20:32:35.8136782Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8136854Z     
2025-05-07T20:32:35.8137023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8137095Z     
2025-05-07T20:32:35.8137187Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8137307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8137440Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8137517Z         x0 = x[:, :D]
2025-05-07T20:32:35.8137655Z         x1 = x[:, D:]
2025-05-07T20:32:35.8137731Z     
2025-05-07T20:32:35.8137813Z         if contiguous:
2025-05-07T20:32:35.8137904Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8137992Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8138065Z     
2025-05-07T20:32:35.8138158Z         if scale_ub is not None:
2025-05-07T20:32:35.8138270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8138402Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8138475Z             )
2025-05-07T20:32:35.8138555Z         else:
2025-05-07T20:32:35.8138647Z             scale_ub_tensor = None
2025-05-07T20:32:35.8138714Z     
2025-05-07T20:32:35.8138852Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8138939Z             op = silu_mul_quant
2025-05-07T20:32:35.8139026Z             if compiled:
2025-05-07T20:32:35.8139127Z                 op = torch.compile(op)
2025-05-07T20:32:35.8139231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8139307Z     
2025-05-07T20:32:35.8139396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8139400Z 
2025-05-07T20:32:35.8139495Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8139624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8139844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8139967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8140479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8140580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8140941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8141159Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8141503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8141601Z     kernel = self.compile(
2025-05-07T20:32:35.8141979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8142152Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8142279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8142284Z 
2025-05-07T20:32:35.8142485Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0f46320>
2025-05-07T20:32:35.8143259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8143762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec0f66200>}
2025-05-07T20:32:35.8144505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8144769Z context = <triton._C.libtriton.ir.context object at 0x7fcec1007af0>
2025-05-07T20:32:35.8144774Z 
2025-05-07T20:32:35.8144954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8145265Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8145377Z                            module_map=module_map)
2025-05-07T20:32:35.8145543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8145639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8145714Z E       ^
2025-05-07T20:32:35.8146106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8146147Z 
2025-05-07T20:32:35.8146562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8146567Z 
2025-05-07T20:32:35.8146670Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8146892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8146968Z     T=2048,
2025-05-07T20:32:35.8147045Z     D=5120,
2025-05-07T20:32:35.8147124Z     scale_ub=None,
2025-05-07T20:32:35.8147205Z     contiguous=True,
2025-05-07T20:32:35.8147287Z     compiled=False,
2025-05-07T20:32:35.8147358Z )
2025-05-07T20:32:35.8147569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8147743Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8147751Z 
2025-05-07T20:32:35.8147822Z     @given(
2025-05-07T20:32:35.8147939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8148035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8148147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8148267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8148377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8148449Z     )
2025-05-07T20:32:35.8148745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8148836Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8148909Z         self,
2025-05-07T20:32:35.8148984Z         T: int,
2025-05-07T20:32:35.8149059Z         D: int,
2025-05-07T20:32:35.8149159Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8149244Z         contiguous: bool,
2025-05-07T20:32:35.8149329Z         compiled: bool,
2025-05-07T20:32:35.8149406Z     ) -> None:
2025-05-07T20:32:35.8149500Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8149571Z     
2025-05-07T20:32:35.8149745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8149817Z     
2025-05-07T20:32:35.8149908Z >       x_sign = torch.sign(x)
2025-05-07T20:32:35.8151697Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8151705Z 
2025-05-07T20:32:35.8151819Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:35.8151824Z 
2025-05-07T20:32:35.8151932Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8152151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8152230Z     T=16384,
2025-05-07T20:32:35.8152304Z     D=5120,
2025-05-07T20:32:35.8152382Z     scale_ub=None,
2025-05-07T20:32:35.8152467Z     contiguous=True,
2025-05-07T20:32:35.8152547Z     compiled=False,
2025-05-07T20:32:35.8152618Z )
2025-05-07T20:32:35.8152878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8153049Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8153053Z 
2025-05-07T20:32:35.8153123Z     @given(
2025-05-07T20:32:35.8153244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8153344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8153460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8153577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8153731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8153806Z     )
2025-05-07T20:32:35.8154089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8154179Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8154255Z         self,
2025-05-07T20:32:35.8154329Z         T: int,
2025-05-07T20:32:35.8154401Z         D: int,
2025-05-07T20:32:35.8154508Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8154600Z         contiguous: bool,
2025-05-07T20:32:35.8154683Z         compiled: bool,
2025-05-07T20:32:35.8154766Z     ) -> None:
2025-05-07T20:32:35.8154862Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8154931Z     
2025-05-07T20:32:35.8155099Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8156894Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8156908Z 
2025-05-07T20:32:35.8157066Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8157072Z 
2025-05-07T20:32:35.8157173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8157391Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8157462Z     T=4096,
2025-05-07T20:32:35.8157535Z     D=5120,
2025-05-07T20:32:35.8157625Z     scale_ub=None,
2025-05-07T20:32:35.8157706Z     contiguous=True,
2025-05-07T20:32:35.8157784Z     compiled=False,
2025-05-07T20:32:35.8157854Z )
2025-05-07T20:32:35.8158064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8158238Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8158249Z 
2025-05-07T20:32:35.8158321Z     @given(
2025-05-07T20:32:35.8158432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8158529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8158640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8158758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8158870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8158940Z     )
2025-05-07T20:32:35.8159180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8159274Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8159347Z         self,
2025-05-07T20:32:35.8159418Z         T: int,
2025-05-07T20:32:35.8159491Z         D: int,
2025-05-07T20:32:35.8159586Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8159678Z         contiguous: bool,
2025-05-07T20:32:35.8159760Z         compiled: bool,
2025-05-07T20:32:35.8159833Z     ) -> None:
2025-05-07T20:32:35.8159929Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8159999Z     
2025-05-07T20:32:35.8160165Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8161923Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8161973Z 
2025-05-07T20:32:35.8162090Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8162133Z 
2025-05-07T20:32:35.8162237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8162490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8162561Z     T=2048,
2025-05-07T20:32:35.8162637Z     D=5120,
2025-05-07T20:32:35.8162717Z     scale_ub=None,
2025-05-07T20:32:35.8162801Z     contiguous=False,
2025-05-07T20:32:35.8162883Z     compiled=False,
2025-05-07T20:32:35.8162954Z )
2025-05-07T20:32:35.8163166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8163335Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.8163339Z 
2025-05-07T20:32:35.8163408Z     @given(
2025-05-07T20:32:35.8163526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8163618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8163732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8163851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8163961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8164037Z     )
2025-05-07T20:32:35.8164275Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8164367Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8164441Z         self,
2025-05-07T20:32:35.8164511Z         T: int,
2025-05-07T20:32:35.8164586Z         D: int,
2025-05-07T20:32:35.8164725Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8164815Z         contiguous: bool,
2025-05-07T20:32:35.8164899Z         compiled: bool,
2025-05-07T20:32:35.8164973Z     ) -> None:
2025-05-07T20:32:35.8165066Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8165136Z     
2025-05-07T20:32:35.8165300Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8167088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8167103Z 
2025-05-07T20:32:35.8167218Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8167222Z 
2025-05-07T20:32:35.8167321Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8167540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8167609Z     T=4096,
2025-05-07T20:32:35.8167679Z     D=7168,
2025-05-07T20:32:35.8167760Z     scale_ub=None,
2025-05-07T20:32:35.8167840Z     contiguous=True,
2025-05-07T20:32:35.8167917Z     compiled=True,
2025-05-07T20:32:35.8167993Z )
2025-05-07T20:32:35.8168206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8168374Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.8168382Z 
2025-05-07T20:32:35.8168457Z     @given(
2025-05-07T20:32:35.8168569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8168666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8168826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8168942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8169057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8169126Z     )
2025-05-07T20:32:35.8169368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8169462Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8169537Z         self,
2025-05-07T20:32:35.8169608Z         T: int,
2025-05-07T20:32:35.8169683Z         D: int,
2025-05-07T20:32:35.8169845Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8169932Z         contiguous: bool,
2025-05-07T20:32:35.8170053Z         compiled: bool,
2025-05-07T20:32:35.8170129Z     ) -> None:
2025-05-07T20:32:35.8170224Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8170292Z     
2025-05-07T20:32:35.8170454Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8172259Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8172268Z 
2025-05-07T20:32:35.8172381Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8172386Z 
2025-05-07T20:32:35.8172489Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8172704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8172775Z     T=2048,
2025-05-07T20:32:35.8172847Z     D=5120,
2025-05-07T20:32:35.8172925Z     scale_ub=1200.0,
2025-05-07T20:32:35.8173050Z     contiguous=False,
2025-05-07T20:32:35.8173130Z     compiled=False,
2025-05-07T20:32:35.8173199Z )
2025-05-07T20:32:35.8173412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8173584Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.8173588Z 
2025-05-07T20:32:35.8173658Z     @given(
2025-05-07T20:32:35.8173773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8173865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8173981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8174097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8174212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8174285Z     )
2025-05-07T20:32:35.8174529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8174618Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8174696Z         self,
2025-05-07T20:32:35.8174772Z         T: int,
2025-05-07T20:32:35.8174844Z         D: int,
2025-05-07T20:32:35.8174942Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8175028Z         contiguous: bool,
2025-05-07T20:32:35.8175110Z         compiled: bool,
2025-05-07T20:32:35.8175187Z     ) -> None:
2025-05-07T20:32:35.8175279Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8175346Z     
2025-05-07T20:32:35.8175512Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8177300Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8177357Z 
2025-05-07T20:32:35.8177475Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8177480Z 
2025-05-07T20:32:35.8177578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8177798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8177868Z     T=4096,
2025-05-07T20:32:35.8177941Z     D=7168,
2025-05-07T20:32:35.8178022Z     scale_ub=1200.0,
2025-05-07T20:32:35.8178149Z     contiguous=True,
2025-05-07T20:32:35.8178227Z     compiled=False,
2025-05-07T20:32:35.8178298Z )
2025-05-07T20:32:35.8178544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8178714Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8178722Z 
2025-05-07T20:32:35.8178798Z     @given(
2025-05-07T20:32:35.8178912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8179019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8179131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8179244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8179359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8179429Z     )
2025-05-07T20:32:35.8179670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8179815Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8179906Z         self,
2025-05-07T20:32:35.8179993Z         T: int,
2025-05-07T20:32:35.8180069Z         D: int,
2025-05-07T20:32:35.8180168Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8180265Z         contiguous: bool,
2025-05-07T20:32:35.8180347Z         compiled: bool,
2025-05-07T20:32:35.8180420Z     ) -> None:
2025-05-07T20:32:35.8180516Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8180588Z     
2025-05-07T20:32:35.8180803Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8182605Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8182614Z 
2025-05-07T20:32:35.8182734Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8182739Z 
2025-05-07T20:32:35.8182842Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8183060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8183138Z     T=16384,
2025-05-07T20:32:35.8183221Z     D=7168,
2025-05-07T20:32:35.8183306Z     scale_ub=None,
2025-05-07T20:32:35.8183394Z     contiguous=False,
2025-05-07T20:32:35.8183472Z     compiled=True,
2025-05-07T20:32:35.8183542Z )
2025-05-07T20:32:35.8183755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8183928Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.8183933Z 
2025-05-07T20:32:35.8184003Z     @given(
2025-05-07T20:32:35.8184117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8184214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8184328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8184455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8184568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8184642Z     )
2025-05-07T20:32:35.8184883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8185025Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8185108Z         self,
2025-05-07T20:32:35.8185181Z         T: int,
2025-05-07T20:32:35.8185253Z         D: int,
2025-05-07T20:32:35.8185352Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8185438Z         contiguous: bool,
2025-05-07T20:32:35.8185520Z         compiled: bool,
2025-05-07T20:32:35.8185597Z     ) -> None:
2025-05-07T20:32:35.8185688Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8185757Z     
2025-05-07T20:32:35.8185924Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8187804Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8187821Z 
2025-05-07T20:32:35.8187940Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8187945Z 
2025-05-07T20:32:35.8188044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8188261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8188332Z     T=4096,
2025-05-07T20:32:35.8188401Z     D=7168,
2025-05-07T20:32:35.8188488Z     scale_ub=None,
2025-05-07T20:32:35.8188568Z     contiguous=True,
2025-05-07T20:32:35.8188648Z     compiled=False,
2025-05-07T20:32:35.8188724Z )
2025-05-07T20:32:35.8188933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8189101Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8189108Z 
2025-05-07T20:32:35.8189184Z     @given(
2025-05-07T20:32:35.8189336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8189436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8189550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8189662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8189773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8190042Z     )
2025-05-07T20:32:35.8190312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8190413Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8190488Z         self,
2025-05-07T20:32:35.8190567Z         T: int,
2025-05-07T20:32:35.8190642Z         D: int,
2025-05-07T20:32:35.8190741Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8190833Z         contiguous: bool,
2025-05-07T20:32:35.8190917Z         compiled: bool,
2025-05-07T20:32:35.8190992Z     ) -> None:
2025-05-07T20:32:35.8191087Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8191163Z     
2025-05-07T20:32:35.8191332Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8193105Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8193113Z 
2025-05-07T20:32:35.8193229Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8193234Z 
2025-05-07T20:32:35.8193338Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8193555Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8193723Z     T=16384,
2025-05-07T20:32:35.8193798Z     D=7168,
2025-05-07T20:32:35.8193879Z     scale_ub=None,
2025-05-07T20:32:35.8193966Z     contiguous=True,
2025-05-07T20:32:35.8194049Z     compiled=False,
2025-05-07T20:32:35.8194120Z )
2025-05-07T20:32:35.8194338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8194511Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:35.8194515Z 
2025-05-07T20:32:35.8194591Z     @given(
2025-05-07T20:32:35.8194774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8194871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8195038Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8195158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8195272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8195348Z     )
2025-05-07T20:32:35.8195602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8195694Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8195775Z         self,
2025-05-07T20:32:35.8195852Z         T: int,
2025-05-07T20:32:35.8195929Z         D: int,
2025-05-07T20:32:35.8196031Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8196118Z         contiguous: bool,
2025-05-07T20:32:35.8196203Z         compiled: bool,
2025-05-07T20:32:35.8196286Z     ) -> None:
2025-05-07T20:32:35.8196376Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8196451Z     
2025-05-07T20:32:35.8196619Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8198442Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8198455Z 
2025-05-07T20:32:35.8198574Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8198578Z 
2025-05-07T20:32:35.8198682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8198906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8198987Z     T=16384,
2025-05-07T20:32:35.8199062Z     D=7168,
2025-05-07T20:32:35.8199146Z     scale_ub=1200.0,
2025-05-07T20:32:35.8199232Z     contiguous=True,
2025-05-07T20:32:35.8199315Z     compiled=False,
2025-05-07T20:32:35.8199389Z )
2025-05-07T20:32:35.8199604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8202624Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8202640Z 
2025-05-07T20:32:35.8202725Z     @given(
2025-05-07T20:32:35.8202848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8202954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8203071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8203193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8203308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8203385Z     )
2025-05-07T20:32:35.8203640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8203739Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8203819Z         self,
2025-05-07T20:32:35.8203901Z         T: int,
2025-05-07T20:32:35.8203978Z         D: int,
2025-05-07T20:32:35.8204079Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8204175Z         contiguous: bool,
2025-05-07T20:32:35.8204261Z         compiled: bool,
2025-05-07T20:32:35.8204433Z     ) -> None:
2025-05-07T20:32:35.8204534Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8204608Z     
2025-05-07T20:32:35.8204786Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8206607Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8206650Z 
2025-05-07T20:32:35.8206776Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8206781Z 
2025-05-07T20:32:35.8206885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8207114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8207195Z     T=128,
2025-05-07T20:32:35.8207272Z     D=5120,
2025-05-07T20:32:35.8207357Z     scale_ub=1200.0,
2025-05-07T20:32:35.8207449Z     contiguous=False,
2025-05-07T20:32:35.8207532Z     compiled=False,
2025-05-07T20:32:35.8207607Z )
2025-05-07T20:32:35.8207827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8208001Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.8208008Z 
2025-05-07T20:32:35.8208087Z     @given(
2025-05-07T20:32:35.8208208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8208308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8208433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8208556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8208672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8208793Z     )
2025-05-07T20:32:35.8209046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8209145Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8209223Z         self,
2025-05-07T20:32:35.8209303Z         T: int,
2025-05-07T20:32:35.8209386Z         D: int,
2025-05-07T20:32:35.8209487Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8209579Z         contiguous: bool,
2025-05-07T20:32:35.8209668Z         compiled: bool,
2025-05-07T20:32:35.8209747Z     ) -> None:
2025-05-07T20:32:35.8209847Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8209923Z     
2025-05-07T20:32:35.8210095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8210169Z     
2025-05-07T20:32:35.8210272Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8210400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8210496Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8210580Z         x0 = x[:, :D]
2025-05-07T20:32:35.8210663Z         x1 = x[:, D:]
2025-05-07T20:32:35.8210742Z     
2025-05-07T20:32:35.8210830Z         if contiguous:
2025-05-07T20:32:35.8210925Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8211020Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8211092Z     
2025-05-07T20:32:35.8211185Z         if scale_ub is not None:
2025-05-07T20:32:35.8211300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8211438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8211516Z             )
2025-05-07T20:32:35.8211605Z         else:
2025-05-07T20:32:35.8211701Z             scale_ub_tensor = None
2025-05-07T20:32:35.8211776Z     
2025-05-07T20:32:35.8211914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8212003Z             op = silu_mul_quant
2025-05-07T20:32:35.8212093Z             if compiled:
2025-05-07T20:32:35.8212197Z                 op = torch.compile(op)
2025-05-07T20:32:35.8212352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8212427Z     
2025-05-07T20:32:35.8212523Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8212529Z 
2025-05-07T20:32:35.8212627Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8212762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8212869Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8212972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8213480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8213621Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8214023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8214254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8214600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8214703Z     kernel = self.compile(
2025-05-07T20:32:35.8215094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8215276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8215405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8215410Z 
2025-05-07T20:32:35.8215617Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0e04c40>
2025-05-07T20:32:35.8216409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8216963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec1229ea0>}
2025-05-07T20:32:35.8217731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8217925Z context = <triton._C.libtriton.ir.context object at 0x7fcec0ed5e70>
2025-05-07T20:32:35.8217930Z 
2025-05-07T20:32:35.8218100Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8218374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8218486Z                            module_map=module_map)
2025-05-07T20:32:35.8218655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8218755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8218833Z E       ^
2025-05-07T20:32:35.8219197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8219204Z 
2025-05-07T20:32:35.8219617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8219622Z 
2025-05-07T20:32:35.8219732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8220064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8220145Z     T=2048,
2025-05-07T20:32:35.8220229Z     D=7168,
2025-05-07T20:32:35.8220312Z     scale_ub=None,
2025-05-07T20:32:35.8220404Z     contiguous=False,
2025-05-07T20:32:35.8220494Z     compiled=False,
2025-05-07T20:32:35.8220566Z )
2025-05-07T20:32:35.8220788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8220969Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.8220974Z 
2025-05-07T20:32:35.8221051Z     @given(
2025-05-07T20:32:35.8221226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8221331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8221452Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8221572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8221690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8221764Z     )
2025-05-07T20:32:35.8222021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8222117Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8222237Z         self,
2025-05-07T20:32:35.8222319Z         T: int,
2025-05-07T20:32:35.8222401Z         D: int,
2025-05-07T20:32:35.8222549Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8222644Z         contiguous: bool,
2025-05-07T20:32:35.8222731Z         compiled: bool,
2025-05-07T20:32:35.8222814Z     ) -> None:
2025-05-07T20:32:35.8222910Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8222990Z     
2025-05-07T20:32:35.8223171Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8224971Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8224979Z 
2025-05-07T20:32:35.8225105Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8225110Z 
2025-05-07T20:32:35.8225214Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8225440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8225527Z     T=128,
2025-05-07T20:32:35.8225641Z     D=7168,
2025-05-07T20:32:35.8225726Z     scale_ub=1200.0,
2025-05-07T20:32:35.8225815Z     contiguous=True,
2025-05-07T20:32:35.8225907Z     compiled=True,
2025-05-07T20:32:35.8225980Z )
2025-05-07T20:32:35.8226200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8226373Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.8226378Z 
2025-05-07T20:32:35.8226454Z     @given(
2025-05-07T20:32:35.8226575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8226682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8226802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8226922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8227037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8227111Z     )
2025-05-07T20:32:35.8227365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8227462Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8227539Z         self,
2025-05-07T20:32:35.8227619Z         T: int,
2025-05-07T20:32:35.8227695Z         D: int,
2025-05-07T20:32:35.8227799Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8227890Z         contiguous: bool,
2025-05-07T20:32:35.8227980Z         compiled: bool,
2025-05-07T20:32:35.8228060Z     ) -> None:
2025-05-07T20:32:35.8228156Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8228228Z     
2025-05-07T20:32:35.8228403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8228480Z     
2025-05-07T20:32:35.8228576Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8228705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8228795Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8228875Z         x0 = x[:, :D]
2025-05-07T20:32:35.8228959Z         x1 = x[:, D:]
2025-05-07T20:32:35.8229081Z     
2025-05-07T20:32:35.8229167Z         if contiguous:
2025-05-07T20:32:35.8229263Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8229352Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8229429Z     
2025-05-07T20:32:35.8229521Z         if scale_ub is not None:
2025-05-07T20:32:35.8229626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8229766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8229842Z             )
2025-05-07T20:32:35.8229921Z         else:
2025-05-07T20:32:35.8230017Z             scale_ub_tensor = None
2025-05-07T20:32:35.8230134Z     
2025-05-07T20:32:35.8230268Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8230398Z             op = silu_mul_quant
2025-05-07T20:32:35.8230486Z             if compiled:
2025-05-07T20:32:35.8230587Z                 op = torch.compile(op)
2025-05-07T20:32:35.8230698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8230772Z     
2025-05-07T20:32:35.8230870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8230877Z 
2025-05-07T20:32:35.8230976Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8231105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8231213Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8231316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8231699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.8231796Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.8232310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8232418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8232787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8233017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8233435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8233533Z     kernel = self.compile(
2025-05-07T20:32:35.8233921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8234100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8234230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8234237Z 
2025-05-07T20:32:35.8234458Z self = <triton.compiler.compiler.ASTSource object at 0x7fcec0dc9210>
2025-05-07T20:32:35.8235260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8235789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd0de37aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcec122b7f0>}
2025-05-07T20:32:35.8236549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8236745Z context = <triton._C.libtriton.ir.context object at 0x7fcec0db85f0>
2025-05-07T20:32:35.8236749Z 
2025-05-07T20:32:35.8236920Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8237193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8237305Z                            module_map=module_map)
2025-05-07T20:32:35.8237468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8237567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8237690Z E       ^
2025-05-07T20:32:35.8238054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8238059Z 
2025-05-07T20:32:35.8238484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8238492Z 
2025-05-07T20:32:35.8238600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8238827Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8238908Z     T=128,
2025-05-07T20:32:35.8239027Z     D=7168,
2025-05-07T20:32:35.8239113Z     scale_ub=1200.0,
2025-05-07T20:32:35.8239203Z     contiguous=True,
2025-05-07T20:32:35.8239324Z     compiled=False,
2025-05-07T20:32:35.8239399Z )
2025-05-07T20:32:35.8239623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8239801Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8239808Z 
2025-05-07T20:32:35.8239888Z     @given(
2025-05-07T20:32:35.8240013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8240115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8240234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8240353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8240469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8240545Z     )
2025-05-07T20:32:35.8240796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8240895Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8240976Z         self,
2025-05-07T20:32:35.8241053Z         T: int,
2025-05-07T20:32:35.8241133Z         D: int,
2025-05-07T20:32:35.8241239Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8241330Z         contiguous: bool,
2025-05-07T20:32:35.8241419Z         compiled: bool,
2025-05-07T20:32:35.8241497Z     ) -> None:
2025-05-07T20:32:35.8241596Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8241712Z     
2025-05-07T20:32:35.8241886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8241962Z     
2025-05-07T20:32:35.8242061Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8242191Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8243999Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8244010Z 
2025-05-07T20:32:35.8244133Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.8244140Z 
2025-05-07T20:32:35.8244246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8244477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8244554Z     T=128,
2025-05-07T20:32:35.8244635Z     D=5120,
2025-05-07T20:32:35.8244719Z     scale_ub=1200.0,
2025-05-07T20:32:35.8244807Z     contiguous=True,
2025-05-07T20:32:35.8244896Z     compiled=True,
2025-05-07T20:32:35.8244970Z )
2025-05-07T20:32:35.8245188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8245369Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.8245374Z 
2025-05-07T20:32:35.8245451Z     @given(
2025-05-07T20:32:35.8245570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8245672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8245788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8245958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8246077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8246152Z     )
2025-05-07T20:32:35.8246403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8246499Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8246577Z         self,
2025-05-07T20:32:35.8246659Z         T: int,
2025-05-07T20:32:35.8246735Z         D: int,
2025-05-07T20:32:35.8246836Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8246974Z         contiguous: bool,
2025-05-07T20:32:35.8247061Z         compiled: bool,
2025-05-07T20:32:35.8247140Z     ) -> None:
2025-05-07T20:32:35.8247279Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8247356Z     
2025-05-07T20:32:35.8247527Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8247605Z     
2025-05-07T20:32:35.8247696Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8247831Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8249625Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8249635Z 
2025-05-07T20:32:35.8249765Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:35.8249770Z 
2025-05-07T20:32:35.8249872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8250099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8250177Z     T=128,
2025-05-07T20:32:35.8250257Z     D=7168,
2025-05-07T20:32:35.8250380Z     scale_ub=None,
2025-05-07T20:32:35.8250472Z     contiguous=True,
2025-05-07T20:32:35.8250557Z     compiled=True,
2025-05-07T20:32:35.8250631Z )
2025-05-07T20:32:35.8250851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8251019Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.8251024Z 
2025-05-07T20:32:35.8251103Z     @given(
2025-05-07T20:32:35.8251222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8251324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8251447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8251568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8251684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8251761Z     )
2025-05-07T20:32:35.8252014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8252119Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8252197Z         self,
2025-05-07T20:32:35.8252280Z         T: int,
2025-05-07T20:32:35.8252360Z         D: int,
2025-05-07T20:32:35.8252462Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8252553Z         contiguous: bool,
2025-05-07T20:32:35.8252642Z         compiled: bool,
2025-05-07T20:32:35.8252721Z     ) -> None:
2025-05-07T20:32:35.8252820Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8252895Z     
2025-05-07T20:32:35.8253066Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8254879Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:35.8254929Z 
2025-05-07T20:32:35.8255053Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:35.8255190Z =============================== warnings summary ===============================
2025-05-07T20:32:35.8255510Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.8255822Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.8256210Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:35.8257104Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:35.8257343Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:35.8257348Z 
2025-05-07T20:32:35.8257565Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:35.8257735Z ================= 1 failed, 1 deselected, 3 warnings in 17.45s =================
2025-05-07T20:32:37.3693197Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:37.4309890Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:37.4310136Z 
2025-05-07T20:32:39.4328078Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:41.5837370Z ============================= test session starts ==============================
2025-05-07T20:32:41.5838029Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:41.5838559Z cachedir: .pytest_cache
2025-05-07T20:32:41.5839145Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:41.5839874Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:41.5840291Z plugins: hypothesis-6.131.14
2025-05-07T20:32:43.1734063Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:43.3491281Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:43.3491695Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:43.3491912Z 
2025-05-07T20:32:45.8639614Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.8640433Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.8640858Z     T=1,
2025-05-07T20:32:45.8641047Z     D=5120,
2025-05-07T20:32:45.8641249Z     scale_ub=None,
2025-05-07T20:32:45.8641474Z     contiguous=True,
2025-05-07T20:32:45.8641699Z     compiled=True,
2025-05-07T20:32:45.8641916Z )
2025-05-07T20:32:45.8642250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.8642742Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.8643019Z 
2025-05-07T20:32:45.8643097Z     @given(
2025-05-07T20:32:45.8643334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.8643662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.8643970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.8644314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.8644651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.8645241Z     )
2025-05-07T20:32:45.8645606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.8646057Z     def test_silu_mul_quant(
2025-05-07T20:32:45.8646297Z         self,
2025-05-07T20:32:45.8646499Z         T: int,
2025-05-07T20:32:45.8646700Z         D: int,
2025-05-07T20:32:45.8646922Z         scale_ub: Optional[float],
2025-05-07T20:32:45.8647201Z         contiguous: bool,
2025-05-07T20:32:45.8647449Z         compiled: bool,
2025-05-07T20:32:45.8647677Z     ) -> None:
2025-05-07T20:32:45.8647997Z         torch.manual_seed(2025)
2025-05-07T20:32:45.8648245Z     
2025-05-07T20:32:45.8648609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.8648956Z     
2025-05-07T20:32:45.8649158Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.8649460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.8649772Z         x = x_sign * x_clamp
2025-05-07T20:32:45.8650025Z         x0 = x[:, :D]
2025-05-07T20:32:45.8650253Z         x1 = x[:, D:]
2025-05-07T20:32:45.8650462Z     
2025-05-07T20:32:45.8650655Z         if contiguous:
2025-05-07T20:32:45.8650902Z             x0 = x0.contiguous()
2025-05-07T20:32:45.8651160Z             x1 = x1.contiguous()
2025-05-07T20:32:45.8651407Z     
2025-05-07T20:32:45.8651606Z         if scale_ub is not None:
2025-05-07T20:32:45.8651881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.8652225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.8652542Z             )
2025-05-07T20:32:45.8652738Z         else:
2025-05-07T20:32:45.8652958Z             scale_ub_tensor = None
2025-05-07T20:32:45.8653218Z     
2025-05-07T20:32:45.8653452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.8653770Z             op = silu_mul_quant
2025-05-07T20:32:45.8654031Z             if compiled:
2025-05-07T20:32:45.8654286Z                 op = torch.compile(op)
2025-05-07T20:32:45.8654672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.8654951Z     
2025-05-07T20:32:45.8655151Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.8655435Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.8655730Z     
2025-05-07T20:32:45.8655974Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.8656306Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.8656603Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.8656927Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.8657292Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.8657607Z     
2025-05-07T20:32:45.8657819Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.8658017Z 
2025-05-07T20:32:45.8658126Z moe/activation_test.py:126: 
2025-05-07T20:32:45.8658423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8658766Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.8659104Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.8660078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.8661185Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.8661979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.8662918Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.8663626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.8664367Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.8665136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:45.8665965Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.8666702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.8667356Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.8667970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.8668491Z     fn()
2025-05-07T20:32:45.8669056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.8669690Z     self.fn.run(
2025-05-07T20:32:45.8670185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.8670723Z     kernel = self.compile(
2025-05-07T20:32:45.8671280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.8671952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.8672348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8672588Z 
2025-05-07T20:32:45.8672800Z self = <triton.compiler.compiler.ASTSource object at 0x7f32571b7ee0>
2025-05-07T20:32:45.8673910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.8675331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32572a4af0>}
2025-05-07T20:32:45.8676768Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.8677819Z context = <triton._C.libtriton.ir.context object at 0x7f325d1c6930>
2025-05-07T20:32:45.8678118Z 
2025-05-07T20:32:45.8678288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.8678822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.8679313Z                            module_map=module_map)
2025-05-07T20:32:45.8679683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.8680051Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.8680321Z E       ^
2025-05-07T20:32:45.8680792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.8681257Z 
2025-05-07T20:32:45.8681682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.8682213Z 
2025-05-07T20:32:45.8682319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.8682740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.8683140Z     T=2048,
2025-05-07T20:32:45.8683333Z     D=5120,
2025-05-07T20:32:45.8683529Z     scale_ub=1200.0,
2025-05-07T20:32:45.8683753Z     contiguous=True,
2025-05-07T20:32:45.8683982Z     compiled=False,
2025-05-07T20:32:45.8684193Z )
2025-05-07T20:32:47.2038163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.2038788Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.2039067Z 
2025-05-07T20:32:47.2039163Z     @given(
2025-05-07T20:32:47.2039412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.2039733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.2040044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.2040606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.2040933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.2041218Z     )
2025-05-07T20:32:47.2041577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.2042024Z     def test_silu_mul_quant(
2025-05-07T20:32:47.2042271Z         self,
2025-05-07T20:32:47.2042475Z         T: int,
2025-05-07T20:32:47.2042671Z         D: int,
2025-05-07T20:32:47.2042892Z         scale_ub: Optional[float],
2025-05-07T20:32:47.2043258Z         contiguous: bool,
2025-05-07T20:32:47.2043496Z         compiled: bool,
2025-05-07T20:32:47.2043727Z     ) -> None:
2025-05-07T20:32:47.2044018Z         torch.manual_seed(2025)
2025-05-07T20:32:47.2044252Z     
2025-05-07T20:32:47.2044530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.2044872Z     
2025-05-07T20:32:47.2045068Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.2045362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.2045674Z         x = x_sign * x_clamp
2025-05-07T20:32:47.2045918Z         x0 = x[:, :D]
2025-05-07T20:32:47.2046129Z         x1 = x[:, D:]
2025-05-07T20:32:47.2046338Z     
2025-05-07T20:32:47.2046526Z         if contiguous:
2025-05-07T20:32:47.2046759Z             x0 = x0.contiguous()
2025-05-07T20:32:47.2047017Z             x1 = x1.contiguous()
2025-05-07T20:32:47.2047257Z     
2025-05-07T20:32:47.2047444Z         if scale_ub is not None:
2025-05-07T20:32:47.2047718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.2048058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.2048358Z             )
2025-05-07T20:32:47.2048555Z         else:
2025-05-07T20:32:47.2048769Z             scale_ub_tensor = None
2025-05-07T20:32:47.2049014Z     
2025-05-07T20:32:47.2049244Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.2049562Z             op = silu_mul_quant
2025-05-07T20:32:47.2049816Z             if compiled:
2025-05-07T20:32:47.2050136Z                 op = torch.compile(op)
2025-05-07T20:32:47.2050444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.2050719Z     
2025-05-07T20:32:47.2050911Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.2051084Z 
2025-05-07T20:32:47.2051187Z moe/activation_test.py:117: 
2025-05-07T20:32:47.2051489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2051819Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.2052104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.2052808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.2053500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.2054034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.2054721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.2055392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.2055918Z     kernel = self.compile(
2025-05-07T20:32:47.2056465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.2057120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.2063592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2063846Z 
2025-05-07T20:32:47.2064065Z self = <triton.compiler.compiler.ASTSource object at 0x7f3257089960>
2025-05-07T20:32:47.2065145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.2066678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3257181990>}
2025-05-07T20:32:47.2068020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.2069042Z context = <triton._C.libtriton.ir.context object at 0x7f3257889870>
2025-05-07T20:32:47.2069329Z 
2025-05-07T20:32:47.2069555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.2070119Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.2070589Z                            module_map=module_map)
2025-05-07T20:32:47.2070962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.2071321Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.2071586Z E       ^
2025-05-07T20:32:47.2072056Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.2072503Z 
2025-05-07T20:32:47.2072927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.2073437Z 
2025-05-07T20:32:47.2073554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.2073963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.2074375Z     T=2048,
2025-05-07T20:32:47.2074575Z     D=5120,
2025-05-07T20:32:47.2074768Z     scale_ub=1200.0,
2025-05-07T20:32:47.2075027Z     contiguous=True,
2025-05-07T20:32:47.2075264Z     compiled=True,
2025-05-07T20:32:47.2075480Z )
2025-05-07T20:32:47.2075798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.2076294Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.2076618Z 
2025-05-07T20:32:47.2076709Z     @given(
2025-05-07T20:32:47.2076941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.2077260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.2077572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.2077907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.2078230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.2078519Z     )
2025-05-07T20:32:47.2078881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.2079323Z     def test_silu_mul_quant(
2025-05-07T20:32:47.2079579Z         self,
2025-05-07T20:32:47.2079788Z         T: int,
2025-05-07T20:32:47.2079992Z         D: int,
2025-05-07T20:32:47.2080222Z         scale_ub: Optional[float],
2025-05-07T20:32:47.2080504Z         contiguous: bool,
2025-05-07T20:32:47.2080740Z         compiled: bool,
2025-05-07T20:32:47.2080981Z     ) -> None:
2025-05-07T20:32:47.2081206Z         torch.manual_seed(2025)
2025-05-07T20:32:47.2081448Z     
2025-05-07T20:32:47.2081726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.2082072Z     
2025-05-07T20:32:47.2082272Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.2082562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.2082877Z         x = x_sign * x_clamp
2025-05-07T20:32:47.2083123Z         x0 = x[:, :D]
2025-05-07T20:32:47.2083340Z         x1 = x[:, D:]
2025-05-07T20:32:47.2083562Z     
2025-05-07T20:32:47.2083755Z         if contiguous:
2025-05-07T20:32:47.2083990Z             x0 = x0.contiguous()
2025-05-07T20:32:47.2084265Z             x1 = x1.contiguous()
2025-05-07T20:32:47.2084513Z     
2025-05-07T20:32:47.2084727Z         if scale_ub is not None:
2025-05-07T20:32:47.2085035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.2085375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.2085742Z             )
2025-05-07T20:32:47.2085944Z         else:
2025-05-07T20:32:47.2086163Z             scale_ub_tensor = None
2025-05-07T20:32:47.2086415Z     
2025-05-07T20:32:47.2086654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.2086980Z             op = silu_mul_quant
2025-05-07T20:32:47.2087233Z             if compiled:
2025-05-07T20:32:47.2087490Z                 op = torch.compile(op)
2025-05-07T20:32:47.2087792Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.2088071Z     
2025-05-07T20:32:47.2088312Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.2088605Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.2088933Z     
2025-05-07T20:32:47.2089179Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.2089513Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.2089801Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.2090523Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.2090894Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.2091200Z     
2025-05-07T20:32:47.2091409Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.2091603Z 
2025-05-07T20:32:47.2091710Z moe/activation_test.py:126: 
2025-05-07T20:32:47.2092006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2092343Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.2092679Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.2093472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.2094220Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.2094798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.2095593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.2096288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.2097004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.2097759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:47.2098511Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.2099237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.2099974Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.2100591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.2101112Z     fn()
2025-05-07T20:32:47.2101632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.2102222Z     self.fn.run(
2025-05-07T20:32:47.2102688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.2103214Z     kernel = self.compile(
2025-05-07T20:32:47.2103757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.2104419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.2104824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2105057Z 
2025-05-07T20:32:47.2105266Z self = <triton.compiler.compiler.ASTSource object at 0x7f32571b5a20>
2025-05-07T20:32:47.2106342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.2107814Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255c1d3f0>}
2025-05-07T20:32:47.2109145Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.2110238Z context = <triton._C.libtriton.ir.context object at 0x7f3255b2cc30>
2025-05-07T20:32:47.2110522Z 
2025-05-07T20:32:47.2110744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.2111268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.2111737Z                            module_map=module_map)
2025-05-07T20:32:47.2112107Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.2112463Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.2112730Z E       ^
2025-05-07T20:32:47.2113201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.2113646Z 
2025-05-07T20:32:47.2114067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.2114614Z 
2025-05-07T20:32:47.2114741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.2115163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.2115570Z     T=16384,
2025-05-07T20:32:47.2115759Z     D=7168,
2025-05-07T20:32:47.2115960Z     scale_ub=1200.0,
2025-05-07T20:32:47.2116192Z     contiguous=False,
2025-05-07T20:32:47.2116417Z     compiled=False,
2025-05-07T20:32:47.2116626Z )
2025-05-07T20:32:48.3986714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.3987517Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:48.3987870Z 
2025-05-07T20:32:48.3987955Z     @given(
2025-05-07T20:32:48.3988192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.3988509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.3988810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.3989141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.3989470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.3989757Z     )
2025-05-07T20:32:48.3990390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.3990834Z     def test_silu_mul_quant(
2025-05-07T20:32:48.3991069Z         self,
2025-05-07T20:32:48.3991268Z         T: int,
2025-05-07T20:32:48.3991472Z         D: int,
2025-05-07T20:32:48.3991689Z         scale_ub: Optional[float],
2025-05-07T20:32:48.3991969Z         contiguous: bool,
2025-05-07T20:32:48.3992213Z         compiled: bool,
2025-05-07T20:32:48.3992438Z     ) -> None:
2025-05-07T20:32:48.3992657Z         torch.manual_seed(2025)
2025-05-07T20:32:48.3992904Z     
2025-05-07T20:32:48.3993184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.3993520Z     
2025-05-07T20:32:48.3993718Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.3994016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.3994322Z         x = x_sign * x_clamp
2025-05-07T20:32:48.3994573Z         x0 = x[:, :D]
2025-05-07T20:32:48.3994796Z         x1 = x[:, D:]
2025-05-07T20:32:48.3994999Z     
2025-05-07T20:32:48.3995192Z         if contiguous:
2025-05-07T20:32:48.3995428Z             x0 = x0.contiguous()
2025-05-07T20:32:48.3995683Z             x1 = x1.contiguous()
2025-05-07T20:32:48.3995923Z     
2025-05-07T20:32:48.3996117Z         if scale_ub is not None:
2025-05-07T20:32:48.3996518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.3996860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.3997171Z             )
2025-05-07T20:32:48.3997357Z         else:
2025-05-07T20:32:48.3997572Z             scale_ub_tensor = None
2025-05-07T20:32:48.3997828Z     
2025-05-07T20:32:48.3998055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.3998365Z             op = silu_mul_quant
2025-05-07T20:32:48.3998619Z             if compiled:
2025-05-07T20:32:48.3998874Z                 op = torch.compile(op)
2025-05-07T20:32:48.3999261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3999536Z     
2025-05-07T20:32:48.3999806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.3999975Z 
2025-05-07T20:32:48.4000075Z moe/activation_test.py:117: 
2025-05-07T20:32:48.4000376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4000708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.4000992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4001684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.4002375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.4002910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4003586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4004249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4004787Z     kernel = self.compile(
2025-05-07T20:32:48.4005325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4005981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4006440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4006671Z 
2025-05-07T20:32:48.4006891Z self = <triton.compiler.compiler.ASTSource object at 0x7f3255f11270>
2025-05-07T20:32:48.4007983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4009361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255c1ce50>}
2025-05-07T20:32:48.4010705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4011715Z context = <triton._C.libtriton.ir.context object at 0x7f3255baa370>
2025-05-07T20:32:48.4012010Z 
2025-05-07T20:32:48.4012180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4012706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4013175Z                            module_map=module_map)
2025-05-07T20:32:48.4013536Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4013887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.4014152Z E       ^
2025-05-07T20:32:48.4014610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4015065Z 
2025-05-07T20:32:48.4015489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4016004Z 
2025-05-07T20:32:48.4016107Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4016520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4017005Z     T=1,
2025-05-07T20:32:48.4017196Z     D=7168,
2025-05-07T20:32:48.4017391Z     scale_ub=None,
2025-05-07T20:32:48.4017600Z     contiguous=True,
2025-05-07T20:32:48.4017827Z     compiled=True,
2025-05-07T20:32:48.4018037Z )
2025-05-07T20:32:48.4018358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.4018839Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.4019098Z 
2025-05-07T20:32:48.4019173Z     @given(
2025-05-07T20:32:48.4019451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.4019862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.4020220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.4020556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.4020875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.4021160Z     )
2025-05-07T20:32:48.4021517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.4021949Z     def test_silu_mul_quant(
2025-05-07T20:32:48.4022189Z         self,
2025-05-07T20:32:48.4022387Z         T: int,
2025-05-07T20:32:48.4022580Z         D: int,
2025-05-07T20:32:48.4022802Z         scale_ub: Optional[float],
2025-05-07T20:32:48.4023075Z         contiguous: bool,
2025-05-07T20:32:48.4023319Z         compiled: bool,
2025-05-07T20:32:48.4023539Z     ) -> None:
2025-05-07T20:32:48.4023757Z         torch.manual_seed(2025)
2025-05-07T20:32:48.4024002Z     
2025-05-07T20:32:48.4024274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.4024612Z     
2025-05-07T20:32:48.4024808Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.4025094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.4025404Z         x = x_sign * x_clamp
2025-05-07T20:32:48.4025649Z         x0 = x[:, :D]
2025-05-07T20:32:48.4025864Z         x1 = x[:, D:]
2025-05-07T20:32:48.4026073Z     
2025-05-07T20:32:48.4026311Z         if contiguous:
2025-05-07T20:32:48.4026544Z             x0 = x0.contiguous()
2025-05-07T20:32:48.4026805Z             x1 = x1.contiguous()
2025-05-07T20:32:48.4027049Z     
2025-05-07T20:32:48.4027237Z         if scale_ub is not None:
2025-05-07T20:32:48.4027514Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.4027849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.4028148Z             )
2025-05-07T20:32:48.4028341Z         else:
2025-05-07T20:32:48.4028558Z             scale_ub_tensor = None
2025-05-07T20:32:48.4028812Z     
2025-05-07T20:32:48.4029039Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4029359Z             op = silu_mul_quant
2025-05-07T20:32:48.4029615Z             if compiled:
2025-05-07T20:32:48.4029864Z                 op = torch.compile(op)
2025-05-07T20:32:48.4030163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4030438Z     
2025-05-07T20:32:48.4030631Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.4030916Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.4031204Z     
2025-05-07T20:32:48.4031436Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4031769Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.4032062Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.4032372Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.4032729Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.4033040Z     
2025-05-07T20:32:48.4033241Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.4033435Z 
2025-05-07T20:32:48.4033538Z moe/activation_test.py:126: 
2025-05-07T20:32:48.4033839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4034173Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.4034495Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.4035333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.4036083Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.4036627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4037300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4037985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.4038787Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.4039538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:48.4040278Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.4041005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.4041639Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.4042231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.4042745Z     fn()
2025-05-07T20:32:48.4043258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.4043839Z     self.fn.run(
2025-05-07T20:32:48.4044304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4044831Z     kernel = self.compile(
2025-05-07T20:32:48.4045377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4046077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4046479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4046711Z 
2025-05-07T20:32:48.4046917Z self = <triton.compiler.compiler.ASTSource object at 0x7f3255c4d1e0>
2025-05-07T20:32:48.4047986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4049351Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559b8c10>}
2025-05-07T20:32:48.4050675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4051695Z context = <triton._C.libtriton.ir.context object at 0x7f3255aedb30>
2025-05-07T20:32:48.4051980Z 
2025-05-07T20:32:48.4052154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4052676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4053132Z                            module_map=module_map)
2025-05-07T20:32:48.4053499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4053853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.4054117Z E       ^
2025-05-07T20:32:48.4054580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4055024Z 
2025-05-07T20:32:48.4055443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4055957Z 
2025-05-07T20:32:48.4056121Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4056530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4056929Z     T=4096,
2025-05-07T20:32:48.4057119Z     D=5120,
2025-05-07T20:32:48.4057307Z     scale_ub=None,
2025-05-07T20:32:48.4057522Z     contiguous=False,
2025-05-07T20:32:48.4057748Z     compiled=False,
2025-05-07T20:32:48.4057944Z )
2025-05-07T20:32:49.9554273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9555074Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:49.9555729Z 
2025-05-07T20:32:49.9555820Z     @given(
2025-05-07T20:32:49.9556169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9556494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9556820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9557176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9557527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9557825Z     )
2025-05-07T20:32:49.9558190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9558630Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9558890Z         self,
2025-05-07T20:32:49.9559096Z         T: int,
2025-05-07T20:32:49.9559296Z         D: int,
2025-05-07T20:32:49.9559530Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9559816Z         contiguous: bool,
2025-05-07T20:32:49.9560070Z         compiled: bool,
2025-05-07T20:32:49.9560308Z     ) -> None:
2025-05-07T20:32:49.9560536Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9560787Z     
2025-05-07T20:32:49.9561065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9561416Z     
2025-05-07T20:32:49.9561616Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9561907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9562219Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9562566Z         x0 = x[:, :D]
2025-05-07T20:32:49.9562788Z         x1 = x[:, D:]
2025-05-07T20:32:49.9563005Z     
2025-05-07T20:32:49.9563202Z         if contiguous:
2025-05-07T20:32:49.9563437Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9563703Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9563949Z     
2025-05-07T20:32:49.9564141Z         if scale_ub is not None:
2025-05-07T20:32:49.9564419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9564762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9565080Z             )
2025-05-07T20:32:49.9565293Z         else:
2025-05-07T20:32:49.9565541Z             scale_ub_tensor = None
2025-05-07T20:32:49.9565802Z     
2025-05-07T20:32:49.9566034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9566357Z             op = silu_mul_quant
2025-05-07T20:32:49.9566618Z             if compiled:
2025-05-07T20:32:49.9566866Z                 op = torch.compile(op)
2025-05-07T20:32:49.9567179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9567457Z     
2025-05-07T20:32:49.9567648Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9567824Z 
2025-05-07T20:32:49.9567929Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9568232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9568561Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9568853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9569553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9570254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9570787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9571477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9572252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9572786Z     kernel = self.compile(
2025-05-07T20:32:49.9573319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9573974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9574371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9574595Z 
2025-05-07T20:32:49.9574856Z self = <triton.compiler.compiler.ASTSource object at 0x7f325729eb30>
2025-05-07T20:32:49.9575971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9577390Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559b9a20>}
2025-05-07T20:32:49.9584814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9585851Z context = <triton._C.libtriton.ir.context object at 0x7f32558feff0>
2025-05-07T20:32:49.9586139Z 
2025-05-07T20:32:49.9586322Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9586862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9587340Z                            module_map=module_map)
2025-05-07T20:32:49.9587722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9588087Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9588350Z E       ^
2025-05-07T20:32:49.9588912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9589373Z 
2025-05-07T20:32:49.9590131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9590684Z 
2025-05-07T20:32:49.9590796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9591220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9591628Z     T=4096,
2025-05-07T20:32:49.9591833Z     D=7168,
2025-05-07T20:32:49.9592033Z     scale_ub=None,
2025-05-07T20:32:49.9592260Z     contiguous=False,
2025-05-07T20:32:49.9592498Z     compiled=False,
2025-05-07T20:32:49.9592714Z )
2025-05-07T20:32:49.9593043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9593547Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:49.9593819Z 
2025-05-07T20:32:49.9593901Z     @given(
2025-05-07T20:32:49.9594144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9594467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9594774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9595112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9595446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9595741Z     )
2025-05-07T20:32:49.9596089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9596535Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9596784Z         self,
2025-05-07T20:32:49.9596979Z         T: int,
2025-05-07T20:32:49.9597189Z         D: int,
2025-05-07T20:32:49.9597422Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9597700Z         contiguous: bool,
2025-05-07T20:32:49.9597955Z         compiled: bool,
2025-05-07T20:32:49.9598190Z     ) -> None:
2025-05-07T20:32:49.9598407Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9598779Z     
2025-05-07T20:32:49.9599064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9599406Z     
2025-05-07T20:32:49.9599610Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9599911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9600231Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9600475Z         x0 = x[:, :D]
2025-05-07T20:32:49.9600704Z         x1 = x[:, D:]
2025-05-07T20:32:49.9600919Z     
2025-05-07T20:32:49.9601107Z         if contiguous:
2025-05-07T20:32:49.9601352Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9601699Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9601937Z     
2025-05-07T20:32:49.9602202Z         if scale_ub is not None:
2025-05-07T20:32:49.9602491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9602832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9603149Z             )
2025-05-07T20:32:49.9603356Z         else:
2025-05-07T20:32:49.9603575Z             scale_ub_tensor = None
2025-05-07T20:32:49.9603840Z     
2025-05-07T20:32:49.9604084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9604397Z             op = silu_mul_quant
2025-05-07T20:32:49.9604661Z             if compiled:
2025-05-07T20:32:49.9604925Z                 op = torch.compile(op)
2025-05-07T20:32:49.9605224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9605512Z     
2025-05-07T20:32:49.9605720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9605890Z 
2025-05-07T20:32:49.9606009Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9606305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9606649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9606945Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9607632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9608402Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9608951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9609636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9610299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9610836Z     kernel = self.compile(
2025-05-07T20:32:49.9611373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9612036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9612440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9612666Z 
2025-05-07T20:32:49.9612875Z self = <triton.compiler.compiler.ASTSource object at 0x7f32558baf50>
2025-05-07T20:32:49.9613953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9615341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559ba560>}
2025-05-07T20:32:49.9616677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9617701Z context = <triton._C.libtriton.ir.context object at 0x7f32558a3230>
2025-05-07T20:32:49.9617985Z 
2025-05-07T20:32:49.9618151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9618674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9619197Z                            module_map=module_map)
2025-05-07T20:32:49.9619568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9620009Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9620278Z E       ^
2025-05-07T20:32:49.9620751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9621193Z 
2025-05-07T20:32:49.9621610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9622175Z 
2025-05-07T20:32:49.9622283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9622751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9623158Z     T=128,
2025-05-07T20:32:49.9623349Z     D=7168,
2025-05-07T20:32:49.9623550Z     scale_ub=None,
2025-05-07T20:32:49.9623778Z     contiguous=False,
2025-05-07T20:32:49.9624008Z     compiled=True,
2025-05-07T20:32:49.9624219Z )
2025-05-07T20:32:50.0250210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.0250952Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.0251327Z 
2025-05-07T20:32:50.0251422Z     @given(
2025-05-07T20:32:50.0251668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.0251989Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.0252304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.0252654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.0252991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.0253291Z     )
2025-05-07T20:32:50.0253640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.0254086Z     def test_silu_mul_quant(
2025-05-07T20:32:50.0254335Z         self,
2025-05-07T20:32:50.0254539Z         T: int,
2025-05-07T20:32:50.0255057Z         D: int,
2025-05-07T20:32:50.0255294Z         scale_ub: Optional[float],
2025-05-07T20:32:50.0255568Z         contiguous: bool,
2025-05-07T20:32:50.0255821Z         compiled: bool,
2025-05-07T20:32:50.0256059Z     ) -> None:
2025-05-07T20:32:50.0256275Z         torch.manual_seed(2025)
2025-05-07T20:32:50.0256527Z     
2025-05-07T20:32:50.0256811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.0257162Z     
2025-05-07T20:32:50.0257352Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.0257655Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.0257967Z         x = x_sign * x_clamp
2025-05-07T20:32:50.0258207Z         x0 = x[:, :D]
2025-05-07T20:32:50.0258433Z         x1 = x[:, D:]
2025-05-07T20:32:50.0258652Z     
2025-05-07T20:32:50.0258843Z         if contiguous:
2025-05-07T20:32:50.0259082Z             x0 = x0.contiguous()
2025-05-07T20:32:50.0259343Z             x1 = x1.contiguous()
2025-05-07T20:32:50.0259590Z     
2025-05-07T20:32:50.0259896Z         if scale_ub is not None:
2025-05-07T20:32:50.0260185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.0260517Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.0260828Z             )
2025-05-07T20:32:50.0261031Z         else:
2025-05-07T20:32:50.0261241Z             scale_ub_tensor = None
2025-05-07T20:32:50.0261492Z     
2025-05-07T20:32:50.0261729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.0262042Z             op = silu_mul_quant
2025-05-07T20:32:50.0262302Z             if compiled:
2025-05-07T20:32:50.0262555Z                 op = torch.compile(op)
2025-05-07T20:32:50.0262856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0263127Z     
2025-05-07T20:32:50.0263322Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.0263613Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.0263895Z     
2025-05-07T20:32:50.0264235Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.0264574Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.0264863Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.0265180Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.0265543Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.0265850Z     
2025-05-07T20:32:50.0266058Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.0266260Z 
2025-05-07T20:32:50.0266362Z moe/activation_test.py:126: 
2025-05-07T20:32:50.0266747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0267146Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.0267481Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.0268262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.0269023Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.0269567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.0270247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.0270928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.0271638Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.0272388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:50.0273135Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.0273858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.0274531Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.0275132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.0275656Z     fn()
2025-05-07T20:32:50.0276157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.0276740Z     self.fn.run(
2025-05-07T20:32:50.0277209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.0277742Z     kernel = self.compile(
2025-05-07T20:32:50.0278277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.0278928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.0279325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0279550Z 
2025-05-07T20:32:50.0279772Z self = <triton.compiler.compiler.ASTSource object at 0x7f32554a3640>
2025-05-07T20:32:50.0280834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.0282202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559d24d0>}
2025-05-07T20:32:50.0283540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.0284569Z context = <triton._C.libtriton.ir.context object at 0x7f3255354370>
2025-05-07T20:32:50.0284850Z 
2025-05-07T20:32:50.0285017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.0285647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.0286114Z                            module_map=module_map)
2025-05-07T20:32:50.0286484Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.0286837Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.0287104Z E       ^
2025-05-07T20:32:50.0287589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.0288076Z 
2025-05-07T20:32:50.0288541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.0289058Z 
2025-05-07T20:32:50.0289166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.0289581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.0290241Z     T=128,
2025-05-07T20:32:50.0290438Z     D=7168,
2025-05-07T20:32:50.0290632Z     scale_ub=None,
2025-05-07T20:32:50.0290853Z     contiguous=False,
2025-05-07T20:32:50.0291083Z     compiled=False,
2025-05-07T20:32:50.0291290Z )
2025-05-07T20:32:50.3911159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.3911835Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.3912118Z 
2025-05-07T20:32:50.3912204Z     @given(
2025-05-07T20:32:50.3912453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.3912768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.3913104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.3913454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.3913781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.3914075Z     )
2025-05-07T20:32:50.3914432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.3914883Z     def test_silu_mul_quant(
2025-05-07T20:32:50.3915466Z         self,
2025-05-07T20:32:50.3915699Z         T: int,
2025-05-07T20:32:50.3915894Z         D: int,
2025-05-07T20:32:50.3916117Z         scale_ub: Optional[float],
2025-05-07T20:32:50.3916392Z         contiguous: bool,
2025-05-07T20:32:50.3916633Z         compiled: bool,
2025-05-07T20:32:50.3916858Z     ) -> None:
2025-05-07T20:32:50.3917076Z         torch.manual_seed(2025)
2025-05-07T20:32:50.3917320Z     
2025-05-07T20:32:50.3917590Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.3917935Z     
2025-05-07T20:32:50.3918136Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.3918427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.3918739Z         x = x_sign * x_clamp
2025-05-07T20:32:50.3918990Z         x0 = x[:, :D]
2025-05-07T20:32:50.3919203Z         x1 = x[:, D:]
2025-05-07T20:32:50.3919414Z     
2025-05-07T20:32:50.3919605Z         if contiguous:
2025-05-07T20:32:50.3919842Z             x0 = x0.contiguous()
2025-05-07T20:32:50.3920107Z             x1 = x1.contiguous()
2025-05-07T20:32:50.3920352Z     
2025-05-07T20:32:50.3920540Z         if scale_ub is not None:
2025-05-07T20:32:50.3920819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.3921157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.3921466Z             )
2025-05-07T20:32:50.3921653Z         else:
2025-05-07T20:32:50.3921867Z             scale_ub_tensor = None
2025-05-07T20:32:50.3922119Z     
2025-05-07T20:32:50.3922350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.3922667Z             op = silu_mul_quant
2025-05-07T20:32:50.3922920Z             if compiled:
2025-05-07T20:32:50.3923167Z                 op = torch.compile(op)
2025-05-07T20:32:50.3923469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3923738Z     
2025-05-07T20:32:50.3923927Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.3924191Z 
2025-05-07T20:32:50.3924296Z moe/activation_test.py:117: 
2025-05-07T20:32:50.3924595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3924922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.3925208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3925899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.3926583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.3927114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.3927947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.3928610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.3929139Z     kernel = self.compile(
2025-05-07T20:32:50.3929676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.3930342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.3930735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3930959Z 
2025-05-07T20:32:50.3931164Z self = <triton.compiler.compiler.ASTSource object at 0x7f325550b580>
2025-05-07T20:32:50.3932232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.3933617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255a2e830>}
2025-05-07T20:32:50.3934998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.3936084Z context = <triton._C.libtriton.ir.context object at 0x7f32553e20f0>
2025-05-07T20:32:50.3936372Z 
2025-05-07T20:32:50.3936541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.3937066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.3937538Z                            module_map=module_map)
2025-05-07T20:32:50.3937913Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.3938261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.3938527Z E       ^
2025-05-07T20:32:50.3938994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.3939447Z 
2025-05-07T20:32:50.3939975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.3940490Z 
2025-05-07T20:32:50.3940592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.3941009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.3941414Z     T=4096,
2025-05-07T20:32:50.3941597Z     D=5120,
2025-05-07T20:32:50.3941793Z     scale_ub=1200.0,
2025-05-07T20:32:50.3942021Z     contiguous=True,
2025-05-07T20:32:50.3942235Z     compiled=False,
2025-05-07T20:32:50.3942442Z )
2025-05-07T20:32:50.3942762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.3943251Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.3943534Z 
2025-05-07T20:32:50.3943613Z     @given(
2025-05-07T20:32:50.3943845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.3944153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.3944465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.3944939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.3945298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.3945678Z     )
2025-05-07T20:32:50.3946047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.3946482Z     def test_silu_mul_quant(
2025-05-07T20:32:50.3946722Z         self,
2025-05-07T20:32:50.3946916Z         T: int,
2025-05-07T20:32:50.3947115Z         D: int,
2025-05-07T20:32:50.3947329Z         scale_ub: Optional[float],
2025-05-07T20:32:50.3947661Z         contiguous: bool,
2025-05-07T20:32:50.3947901Z         compiled: bool,
2025-05-07T20:32:50.3948116Z     ) -> None:
2025-05-07T20:32:50.3948403Z         torch.manual_seed(2025)
2025-05-07T20:32:50.3948651Z     
2025-05-07T20:32:50.3948919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.3949259Z     
2025-05-07T20:32:50.3949452Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.3949743Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.3950053Z         x = x_sign * x_clamp
2025-05-07T20:32:50.3950294Z         x0 = x[:, :D]
2025-05-07T20:32:50.3950513Z         x1 = x[:, D:]
2025-05-07T20:32:50.3950712Z     
2025-05-07T20:32:50.3950896Z         if contiguous:
2025-05-07T20:32:50.3951138Z             x0 = x0.contiguous()
2025-05-07T20:32:50.3951386Z             x1 = x1.contiguous()
2025-05-07T20:32:50.3951631Z     
2025-05-07T20:32:50.3951834Z         if scale_ub is not None:
2025-05-07T20:32:50.3952108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.3952445Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.3952760Z             )
2025-05-07T20:32:50.3952947Z         else:
2025-05-07T20:32:50.3953163Z             scale_ub_tensor = None
2025-05-07T20:32:50.3953417Z     
2025-05-07T20:32:50.3953644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.3953973Z             op = silu_mul_quant
2025-05-07T20:32:50.3954280Z             if compiled:
2025-05-07T20:32:50.3954529Z                 op = torch.compile(op)
2025-05-07T20:32:50.3954835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3955116Z     
2025-05-07T20:32:50.3955317Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.3955489Z 
2025-05-07T20:32:50.3955590Z moe/activation_test.py:117: 
2025-05-07T20:32:50.3955893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3956229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.3956516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.3957208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.3957895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.3958435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.3959119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.3959779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.3960308Z     kernel = self.compile(
2025-05-07T20:32:50.3960839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.3961493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.3961889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.3962116Z 
2025-05-07T20:32:50.3962329Z self = <triton.compiler.compiler.ASTSource object at 0x7f32551be650>
2025-05-07T20:32:50.3963405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.3964817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255a2e050>}
2025-05-07T20:32:50.3966150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.3967164Z context = <triton._C.libtriton.ir.context object at 0x7f32553e6fb0>
2025-05-07T20:32:50.3967496Z 
2025-05-07T20:32:50.3967670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.3968221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.3968693Z                            module_map=module_map)
2025-05-07T20:32:50.3969061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.3969413Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.3969673Z E       ^
2025-05-07T20:32:50.3970140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.3970593Z 
2025-05-07T20:32:50.3971011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.3971516Z 
2025-05-07T20:32:50.3971620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.3972027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.3972432Z     T=1,
2025-05-07T20:32:50.3972612Z     D=5120,
2025-05-07T20:32:50.3972808Z     scale_ub=None,
2025-05-07T20:32:50.3973029Z     contiguous=True,
2025-05-07T20:32:50.3973245Z     compiled=True,
2025-05-07T20:32:50.3973451Z )
2025-05-07T20:32:50.9695591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9696266Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:50.9696572Z 
2025-05-07T20:32:50.9696661Z     @given(
2025-05-07T20:32:50.9696895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9697210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9697529Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9702738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9703091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9703377Z     )
2025-05-07T20:32:50.9703743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9704186Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9704437Z         self,
2025-05-07T20:32:50.9704643Z         T: int,
2025-05-07T20:32:50.9704849Z         D: int,
2025-05-07T20:32:50.9705070Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9705350Z         contiguous: bool,
2025-05-07T20:32:50.9705599Z         compiled: bool,
2025-05-07T20:32:50.9705834Z     ) -> None:
2025-05-07T20:32:50.9706050Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9706298Z     
2025-05-07T20:32:50.9706576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9706912Z     
2025-05-07T20:32:50.9707110Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9707409Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9707714Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9707959Z         x0 = x[:, :D]
2025-05-07T20:32:50.9708184Z         x1 = x[:, D:]
2025-05-07T20:32:50.9708391Z     
2025-05-07T20:32:50.9708581Z         if contiguous:
2025-05-07T20:32:50.9708819Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9709083Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9709328Z     
2025-05-07T20:32:50.9709528Z         if scale_ub is not None:
2025-05-07T20:32:50.9709803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9710145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9710571Z             )
2025-05-07T20:32:50.9710766Z         else:
2025-05-07T20:32:50.9710976Z             scale_ub_tensor = None
2025-05-07T20:32:50.9711231Z     
2025-05-07T20:32:50.9711470Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9711777Z             op = silu_mul_quant
2025-05-07T20:32:50.9712034Z             if compiled:
2025-05-07T20:32:50.9712287Z                 op = torch.compile(op)
2025-05-07T20:32:50.9712578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9712932Z     
2025-05-07T20:32:50.9713127Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.9713412Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.9713763Z     
2025-05-07T20:32:50.9714008Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9714340Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.9714633Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.9714955Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.9715313Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.9715618Z     
2025-05-07T20:32:50.9715824Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.9716046Z 
2025-05-07T20:32:50.9716176Z moe/activation_test.py:126: 
2025-05-07T20:32:50.9716477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9716813Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.9717147Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.9717936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.9718704Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.9719253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9719983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9720663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.9721384Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.9722133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:50.9722879Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.9723608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.9724244Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.9724843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.9725362Z     fn()
2025-05-07T20:32:50.9725897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.9726502Z     self.fn.run(
2025-05-07T20:32:50.9726970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9727496Z     kernel = self.compile(
2025-05-07T20:32:50.9728031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9728691Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9729087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9729311Z 
2025-05-07T20:32:50.9729522Z self = <triton.compiler.compiler.ASTSource object at 0x7f32558bb8e0>
2025-05-07T20:32:50.9730588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9731987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255a2f250>}
2025-05-07T20:32:50.9733315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9734384Z context = <triton._C.libtriton.ir.context object at 0x7f3254d132f0>
2025-05-07T20:32:50.9734667Z 
2025-05-07T20:32:50.9734874Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9735397Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9735877Z                            module_map=module_map)
2025-05-07T20:32:50.9736293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9736660Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.9736919Z E       ^
2025-05-07T20:32:50.9737396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9737838Z 
2025-05-07T20:32:50.9738259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9738774Z 
2025-05-07T20:32:50.9738890Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9739296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9739695Z     T=2048,
2025-05-07T20:32:50.9739979Z     D=5120,
2025-05-07T20:32:50.9740170Z     scale_ub=None,
2025-05-07T20:32:50.9740384Z     contiguous=True,
2025-05-07T20:32:50.9740607Z     compiled=True,
2025-05-07T20:32:50.9740807Z )
2025-05-07T20:32:51.5060274Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.5060825Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.5061100Z 
2025-05-07T20:32:51.5061180Z     @given(
2025-05-07T20:32:51.5061423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.5061740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.5062056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.5062395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.5062731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.5063025Z     )
2025-05-07T20:32:51.5063388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.5063829Z     def test_silu_mul_quant(
2025-05-07T20:32:51.5064077Z         self,
2025-05-07T20:32:51.5064279Z         T: int,
2025-05-07T20:32:51.5064478Z         D: int,
2025-05-07T20:32:51.5064708Z         scale_ub: Optional[float],
2025-05-07T20:32:51.5064992Z         contiguous: bool,
2025-05-07T20:32:51.5065242Z         compiled: bool,
2025-05-07T20:32:51.5065469Z     ) -> None:
2025-05-07T20:32:51.5065694Z         torch.manual_seed(2025)
2025-05-07T20:32:51.5065942Z     
2025-05-07T20:32:51.5066217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.5066564Z     
2025-05-07T20:32:51.5066764Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.5067058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.5067372Z         x = x_sign * x_clamp
2025-05-07T20:32:51.5067622Z         x0 = x[:, :D]
2025-05-07T20:32:51.5067840Z         x1 = x[:, D:]
2025-05-07T20:32:51.5068054Z     
2025-05-07T20:32:51.5068248Z         if contiguous:
2025-05-07T20:32:51.5068482Z             x0 = x0.contiguous()
2025-05-07T20:32:51.5068744Z             x1 = x1.contiguous()
2025-05-07T20:32:51.5068987Z     
2025-05-07T20:32:51.5069181Z         if scale_ub is not None:
2025-05-07T20:32:51.5069544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.5069885Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.5070196Z             )
2025-05-07T20:32:51.5070386Z         else:
2025-05-07T20:32:51.5070606Z             scale_ub_tensor = None
2025-05-07T20:32:51.5070861Z     
2025-05-07T20:32:51.5071094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.5071413Z             op = silu_mul_quant
2025-05-07T20:32:51.5071671Z             if compiled:
2025-05-07T20:32:51.5071920Z                 op = torch.compile(op)
2025-05-07T20:32:51.5072317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5072600Z     
2025-05-07T20:32:51.5072884Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.5073184Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.5073474Z     
2025-05-07T20:32:51.5073710Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.5074048Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.5074353Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.5074674Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.5075032Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.5075349Z     
2025-05-07T20:32:51.5075564Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.5075760Z 
2025-05-07T20:32:51.5075864Z moe/activation_test.py:126: 
2025-05-07T20:32:51.5076169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5076511Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.5076838Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.5077637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.5078389Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.5078987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.5079667Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.5080356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.5081079Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.5081833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:51.5082581Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.5083307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.5083948Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.5084548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.5085070Z     fn()
2025-05-07T20:32:51.5085586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.5086165Z     self.fn.run(
2025-05-07T20:32:51.5086631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.5087162Z     kernel = self.compile(
2025-05-07T20:32:51.5087711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.5088372Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.5088772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5089007Z 
2025-05-07T20:32:51.5089218Z self = <triton.compiler.compiler.ASTSource object at 0x7f325503b760>
2025-05-07T20:32:51.5090507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.5091870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32554ebbe0>}
2025-05-07T20:32:51.5093199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.5094357Z context = <triton._C.libtriton.ir.context object at 0x7f3254ecff30>
2025-05-07T20:32:51.5094651Z 
2025-05-07T20:32:51.5094820Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.5095339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.5095805Z                            module_map=module_map)
2025-05-07T20:32:51.5096170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.5096526Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.5096789Z E       ^
2025-05-07T20:32:51.5097251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.5097697Z 
2025-05-07T20:32:51.5098109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.5098627Z 
2025-05-07T20:32:51.5098735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5099152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5099557Z     T=128,
2025-05-07T20:32:51.5099829Z     D=5120,
2025-05-07T20:32:51.5100031Z     scale_ub=None,
2025-05-07T20:32:51.5100247Z     contiguous=True,
2025-05-07T20:32:51.5100536Z     compiled=True,
2025-05-07T20:32:51.5100748Z )
2025-05-07T20:32:52.4087885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.4088510Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.4088871Z 
2025-05-07T20:32:52.4088991Z     @given(
2025-05-07T20:32:52.4089230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.4089549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.4090129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.4090499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.4090849Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.4091143Z     )
2025-05-07T20:32:52.4091498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.4091935Z     def test_silu_mul_quant(
2025-05-07T20:32:52.4092187Z         self,
2025-05-07T20:32:52.4092397Z         T: int,
2025-05-07T20:32:52.4092599Z         D: int,
2025-05-07T20:32:52.4092826Z         scale_ub: Optional[float],
2025-05-07T20:32:52.4093104Z         contiguous: bool,
2025-05-07T20:32:52.4093343Z         compiled: bool,
2025-05-07T20:32:52.4093579Z     ) -> None:
2025-05-07T20:32:52.4093800Z         torch.manual_seed(2025)
2025-05-07T20:32:52.4094041Z     
2025-05-07T20:32:52.4094324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.4094670Z     
2025-05-07T20:32:52.4094863Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.4095164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.4095478Z         x = x_sign * x_clamp
2025-05-07T20:32:52.4095716Z         x0 = x[:, :D]
2025-05-07T20:32:52.4095935Z         x1 = x[:, D:]
2025-05-07T20:32:52.4096175Z     
2025-05-07T20:32:52.4096390Z         if contiguous:
2025-05-07T20:32:52.4096625Z             x0 = x0.contiguous()
2025-05-07T20:32:52.4096891Z             x1 = x1.contiguous()
2025-05-07T20:32:52.4097454Z     
2025-05-07T20:32:52.4097648Z         if scale_ub is not None:
2025-05-07T20:32:52.4097926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.4098264Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.4098564Z             )
2025-05-07T20:32:52.4098762Z         else:
2025-05-07T20:32:52.4098977Z             scale_ub_tensor = None
2025-05-07T20:32:52.4099224Z     
2025-05-07T20:32:52.4099459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.4099841Z             op = silu_mul_quant
2025-05-07T20:32:52.4100190Z             if compiled:
2025-05-07T20:32:52.4100443Z                 op = torch.compile(op)
2025-05-07T20:32:52.4100821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4101098Z     
2025-05-07T20:32:52.4101298Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.4101588Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.4101877Z     
2025-05-07T20:32:52.4102120Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.4102455Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.4102752Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.4103061Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.4103428Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.4103739Z     
2025-05-07T20:32:52.4103942Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.4104142Z 
2025-05-07T20:32:52.4104250Z moe/activation_test.py:126: 
2025-05-07T20:32:52.4104556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4104895Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.4105217Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.4106010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.4106852Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.4107401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.4108087Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.4108775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.4109495Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.4110246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:52.4110995Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.4111724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.4112367Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.4112962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.4113481Z     fn()
2025-05-07T20:32:52.4113991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.4114565Z     self.fn.run(
2025-05-07T20:32:52.4115034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.4115565Z     kernel = self.compile(
2025-05-07T20:32:52.4116138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.4116813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.4117208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4117486Z 
2025-05-07T20:32:52.4117700Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254d2e0b0>
2025-05-07T20:32:52.4118773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.4120152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f24280>}
2025-05-07T20:32:52.4121821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.4122859Z context = <triton._C.libtriton.ir.context object at 0x7f3254febcb0>
2025-05-07T20:32:52.4123145Z 
2025-05-07T20:32:52.4123316Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.4123835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.4124303Z                            module_map=module_map)
2025-05-07T20:32:52.4124672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.4125025Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.4125289Z E       ^
2025-05-07T20:32:52.4125753Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.4126200Z 
2025-05-07T20:32:52.4126623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.4127132Z 
2025-05-07T20:32:52.4127237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.4127648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.4128055Z     T=4096,
2025-05-07T20:32:52.4128250Z     D=5120,
2025-05-07T20:32:52.4128488Z     scale_ub=None,
2025-05-07T20:32:52.4128706Z     contiguous=True,
2025-05-07T20:32:52.4128935Z     compiled=True,
2025-05-07T20:32:52.4129138Z )
2025-05-07T20:32:53.1372832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1373528Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.1373794Z 
2025-05-07T20:32:53.1373883Z     @given(
2025-05-07T20:32:53.1374111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1374448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1374757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1375099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1375427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1375713Z     )
2025-05-07T20:32:53.1376060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1376553Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1376809Z         self,
2025-05-07T20:32:53.1377004Z         T: int,
2025-05-07T20:32:53.1377198Z         D: int,
2025-05-07T20:32:53.1377420Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1377694Z         contiguous: bool,
2025-05-07T20:32:53.1377930Z         compiled: bool,
2025-05-07T20:32:53.1378161Z     ) -> None:
2025-05-07T20:32:53.1378379Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1378619Z     
2025-05-07T20:32:53.1378897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1379247Z     
2025-05-07T20:32:53.1379438Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1379733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1380202Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1380447Z         x0 = x[:, :D]
2025-05-07T20:32:53.1380658Z         x1 = x[:, D:]
2025-05-07T20:32:53.1380871Z     
2025-05-07T20:32:53.1381061Z         if contiguous:
2025-05-07T20:32:53.1381612Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1381875Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1382116Z     
2025-05-07T20:32:53.1382305Z         if scale_ub is not None:
2025-05-07T20:32:53.1382577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1382914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1383218Z             )
2025-05-07T20:32:53.1383413Z         else:
2025-05-07T20:32:53.1383627Z             scale_ub_tensor = None
2025-05-07T20:32:53.1383874Z     
2025-05-07T20:32:53.1384205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1384515Z             op = silu_mul_quant
2025-05-07T20:32:53.1385396Z             if compiled:
2025-05-07T20:32:53.1385653Z                 op = torch.compile(op)
2025-05-07T20:32:53.1385949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1386225Z     
2025-05-07T20:32:53.1386414Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.1386712Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.1387006Z     
2025-05-07T20:32:53.1387237Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1387569Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.1387866Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.1388175Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.1388539Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.1388845Z     
2025-05-07T20:32:53.1389043Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.1389249Z 
2025-05-07T20:32:53.1389349Z moe/activation_test.py:126: 
2025-05-07T20:32:53.1389655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1390286Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.1390610Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.1391486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.1392244Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.1392781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1393460Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1394154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.1394887Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.1395636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:53.1396393Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.1397127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.1397781Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.1398382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.1398909Z     fn()
2025-05-07T20:32:53.1399422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.1399997Z     self.fn.run(
2025-05-07T20:32:53.1400470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1406814Z     kernel = self.compile(
2025-05-07T20:32:53.1407403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1408070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1408599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1408831Z 
2025-05-07T20:32:53.1409054Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254a85180>
2025-05-07T20:32:53.1410141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1411517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f252d0>}
2025-05-07T20:32:53.1413007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1414052Z context = <triton._C.libtriton.ir.context object at 0x7f32546de2b0>
2025-05-07T20:32:53.1414345Z 
2025-05-07T20:32:53.1414530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1415058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1415537Z                            module_map=module_map)
2025-05-07T20:32:53.1415915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1416307Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.1416605Z E       ^
2025-05-07T20:32:53.1417082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1417533Z 
2025-05-07T20:32:53.1417962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1418474Z 
2025-05-07T20:32:53.1418585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1419054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1419474Z     T=16384,
2025-05-07T20:32:53.1419689Z     D=5120,
2025-05-07T20:32:53.1420003Z     scale_ub=None,
2025-05-07T20:32:53.1420235Z     contiguous=True,
2025-05-07T20:32:53.1420478Z     compiled=True,
2025-05-07T20:32:53.1420691Z )
2025-05-07T20:32:53.1800143Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:53.1801423Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:53.1802779Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:53.1803764Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:53.1804866Z W0507 20:32:53.178000 88454 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:53.2821219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2821832Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.2822114Z 
2025-05-07T20:32:53.2822206Z     @given(
2025-05-07T20:32:53.2822460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2822782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2823104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2823449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2823782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2824081Z     )
2025-05-07T20:32:53.2824710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2825151Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2825407Z         self,
2025-05-07T20:32:53.2825616Z         T: int,
2025-05-07T20:32:53.2825816Z         D: int,
2025-05-07T20:32:53.2826048Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2826361Z         contiguous: bool,
2025-05-07T20:32:53.2826628Z         compiled: bool,
2025-05-07T20:32:53.2826872Z     ) -> None:
2025-05-07T20:32:53.2827109Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2827439Z     
2025-05-07T20:32:53.2827730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2828084Z     
2025-05-07T20:32:53.2828354Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.2828659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.2828987Z         x = x_sign * x_clamp
2025-05-07T20:32:53.2829241Z         x0 = x[:, :D]
2025-05-07T20:32:53.2829467Z         x1 = x[:, D:]
2025-05-07T20:32:53.2829689Z     
2025-05-07T20:32:53.2829888Z         if contiguous:
2025-05-07T20:32:53.2830124Z             x0 = x0.contiguous()
2025-05-07T20:32:53.2830398Z             x1 = x1.contiguous()
2025-05-07T20:32:53.2830650Z     
2025-05-07T20:32:53.2830846Z         if scale_ub is not None:
2025-05-07T20:32:53.2831134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.2831483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.2831793Z             )
2025-05-07T20:32:53.2832002Z         else:
2025-05-07T20:32:53.2832231Z             scale_ub_tensor = None
2025-05-07T20:32:53.2832495Z     
2025-05-07T20:32:53.2832740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.2833055Z             op = silu_mul_quant
2025-05-07T20:32:53.2833318Z             if compiled:
2025-05-07T20:32:53.2833575Z                 op = torch.compile(op)
2025-05-07T20:32:53.2833873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.2834163Z     
2025-05-07T20:32:53.2834452Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.2834747Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.2835050Z     
2025-05-07T20:32:53.2835294Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.2835632Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.2835935Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.2836259Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.2836631Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.2836943Z     
2025-05-07T20:32:53.2837154Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.2837353Z 
2025-05-07T20:32:53.2837462Z moe/activation_test.py:126: 
2025-05-07T20:32:53.2837761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.2838105Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.2838449Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.2839242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.2839999Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.2840558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.2841252Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.2841937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.2842672Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.2843443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:53.2844206Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.2845042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.2845689Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.2846296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.2846817Z     fn()
2025-05-07T20:32:53.2847321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.2847957Z     self.fn.run(
2025-05-07T20:32:53.2848472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.2849000Z     kernel = self.compile(
2025-05-07T20:32:53.2849551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.2850221Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.2850624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.2850850Z 
2025-05-07T20:32:53.2851060Z self = <triton.compiler.compiler.ASTSource object at 0x7f32545593c0>
2025-05-07T20:32:53.2852132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.2853507Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f36320>}
2025-05-07T20:32:53.2854835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.2855898Z context = <triton._C.libtriton.ir.context object at 0x7f3133fbe0b0>
2025-05-07T20:32:53.2856190Z 
2025-05-07T20:32:53.2856359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.2856885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.2857357Z                            module_map=module_map)
2025-05-07T20:32:53.2857723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.2858087Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.2858365Z E       ^
2025-05-07T20:32:53.2858825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.2859277Z 
2025-05-07T20:32:53.2859694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.2860335Z 
2025-05-07T20:32:53.2860449Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2860865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2861264Z     T=1,
2025-05-07T20:32:53.2861464Z     D=5120,
2025-05-07T20:32:53.2861664Z     scale_ub=1200.0,
2025-05-07T20:32:53.2861885Z     contiguous=True,
2025-05-07T20:32:53.2862118Z     compiled=True,
2025-05-07T20:32:53.2862324Z )
2025-05-07T20:32:53.4308507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.4309216Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.4309501Z 
2025-05-07T20:32:53.4309580Z     @given(
2025-05-07T20:32:53.4309830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.4310154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.4310459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.4310796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.4311443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.4311731Z     )
2025-05-07T20:32:53.4312089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.4312528Z     def test_silu_mul_quant(
2025-05-07T20:32:53.4312776Z         self,
2025-05-07T20:32:53.4312969Z         T: int,
2025-05-07T20:32:53.4313172Z         D: int,
2025-05-07T20:32:53.4313401Z         scale_ub: Optional[float],
2025-05-07T20:32:53.4313668Z         contiguous: bool,
2025-05-07T20:32:53.4313916Z         compiled: bool,
2025-05-07T20:32:53.4314235Z     ) -> None:
2025-05-07T20:32:53.4314453Z         torch.manual_seed(2025)
2025-05-07T20:32:53.4314705Z     
2025-05-07T20:32:53.4315064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.4315401Z     
2025-05-07T20:32:53.4315604Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.4315905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.4316216Z         x = x_sign * x_clamp
2025-05-07T20:32:53.4316470Z         x0 = x[:, :D]
2025-05-07T20:32:53.4316694Z         x1 = x[:, D:]
2025-05-07T20:32:53.4316898Z     
2025-05-07T20:32:53.4317091Z         if contiguous:
2025-05-07T20:32:53.4317330Z             x0 = x0.contiguous()
2025-05-07T20:32:53.4317594Z             x1 = x1.contiguous()
2025-05-07T20:32:53.4317835Z     
2025-05-07T20:32:53.4318035Z         if scale_ub is not None:
2025-05-07T20:32:53.4318315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.4318651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.4318979Z             )
2025-05-07T20:32:53.4319176Z         else:
2025-05-07T20:32:53.4319392Z             scale_ub_tensor = None
2025-05-07T20:32:53.4319656Z     
2025-05-07T20:32:53.4319896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.4320206Z             op = silu_mul_quant
2025-05-07T20:32:53.4320468Z             if compiled:
2025-05-07T20:32:53.4320730Z                 op = torch.compile(op)
2025-05-07T20:32:53.4321332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4321618Z     
2025-05-07T20:32:53.4321820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.4321988Z 
2025-05-07T20:32:53.4322100Z moe/activation_test.py:117: 
2025-05-07T20:32:53.4322392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4322733Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.4323025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.4323581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.4324149Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.4324816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.4325500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.4326039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.4326732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.4327393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.4327949Z     kernel = self.compile(
2025-05-07T20:32:53.4328486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.4329151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.4329552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.4329781Z 
2025-05-07T20:32:53.4329999Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254f44190>
2025-05-07T20:32:53.4331064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.4332508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f325442a710>}
2025-05-07T20:32:53.4333854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.4334917Z context = <triton._C.libtriton.ir.context object at 0x7f32542af1b0>
2025-05-07T20:32:53.4335202Z 
2025-05-07T20:32:53.4335413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.4335944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.4336424Z                            module_map=module_map)
2025-05-07T20:32:53.4336801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.4337147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.4337408Z E       ^
2025-05-07T20:32:53.4337876Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.4338320Z 
2025-05-07T20:32:53.4338731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.4339244Z 
2025-05-07T20:32:53.4339350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.4339867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.4340275Z     T=1,
2025-05-07T20:32:53.4340456Z     D=5120,
2025-05-07T20:32:53.4340650Z     scale_ub=None,
2025-05-07T20:32:53.4340873Z     contiguous=False,
2025-05-07T20:32:53.4341094Z     compiled=True,
2025-05-07T20:32:53.4341305Z )
2025-05-07T20:32:53.5016427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5017327Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.5017593Z 
2025-05-07T20:32:53.5017671Z     @given(
2025-05-07T20:32:53.5017905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5018216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5018523Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5018850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5019180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5019472Z     )
2025-05-07T20:32:53.5019900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5020348Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5020592Z         self,
2025-05-07T20:32:53.5020780Z         T: int,
2025-05-07T20:32:53.5020982Z         D: int,
2025-05-07T20:32:53.5021206Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5021475Z         contiguous: bool,
2025-05-07T20:32:53.5021719Z         compiled: bool,
2025-05-07T20:32:53.5021951Z     ) -> None:
2025-05-07T20:32:53.5022167Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5022408Z     
2025-05-07T20:32:53.5022686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5023037Z     
2025-05-07T20:32:53.5023228Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5023522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5023833Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5024074Z         x0 = x[:, :D]
2025-05-07T20:32:53.5024296Z         x1 = x[:, D:]
2025-05-07T20:32:53.5024504Z     
2025-05-07T20:32:53.5024690Z         if contiguous:
2025-05-07T20:32:53.5024925Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5025185Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5025418Z     
2025-05-07T20:32:53.5025614Z         if scale_ub is not None:
2025-05-07T20:32:53.5025890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5026324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5026680Z             )
2025-05-07T20:32:53.5026878Z         else:
2025-05-07T20:32:53.5027088Z             scale_ub_tensor = None
2025-05-07T20:32:53.5027342Z     
2025-05-07T20:32:53.5027574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5027884Z             op = silu_mul_quant
2025-05-07T20:32:53.5028136Z             if compiled:
2025-05-07T20:32:53.5028386Z                 op = torch.compile(op)
2025-05-07T20:32:53.5028756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5029025Z     
2025-05-07T20:32:53.5029219Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.5029575Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.5029862Z     
2025-05-07T20:32:53.5030100Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5030433Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.5030725Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.5031041Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.5031397Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.5031699Z     
2025-05-07T20:32:53.5031903Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.5032100Z 
2025-05-07T20:32:53.5032201Z moe/activation_test.py:126: 
2025-05-07T20:32:53.5032504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5032833Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.5033156Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.5033936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.5034677Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.5035298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5035979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5036659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.5037375Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.5038121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:53.5038865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.5039588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.5040215Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.5040816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.5041334Z     fn()
2025-05-07T20:32:53.5041834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.5042417Z     self.fn.run(
2025-05-07T20:32:53.5042888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5043423Z     kernel = self.compile(
2025-05-07T20:32:53.5043956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5044614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5045010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5045235Z 
2025-05-07T20:32:53.5045448Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254559570>
2025-05-07T20:32:53.5046515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5047942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f34a60>}
2025-05-07T20:32:53.5049269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5050377Z context = <triton._C.libtriton.ir.context object at 0x7f3254273db0>
2025-05-07T20:32:53.5050660Z 
2025-05-07T20:32:53.5050827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5051347Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5051818Z                            module_map=module_map)
2025-05-07T20:32:53.5052185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5052533Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.5052797Z E       ^
2025-05-07T20:32:53.5053259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5053699Z 
2025-05-07T20:32:53.5054115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5054629Z 
2025-05-07T20:32:53.5054733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5055149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5055549Z     T=1,
2025-05-07T20:32:53.5055725Z     D=5120,
2025-05-07T20:32:53.5055918Z     scale_ub=None,
2025-05-07T20:32:53.5056136Z     contiguous=True,
2025-05-07T20:32:53.5056357Z     compiled=False,
2025-05-07T20:32:53.5056612Z )
2025-05-07T20:32:53.8363444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8364390Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:53.8364793Z 
2025-05-07T20:32:53.8364919Z     @given(
2025-05-07T20:32:53.8365259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8365729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8366185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8366599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8366935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8367234Z     )
2025-05-07T20:32:53.8367581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8368021Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8368268Z         self,
2025-05-07T20:32:53.8368457Z         T: int,
2025-05-07T20:32:53.8368663Z         D: int,
2025-05-07T20:32:53.8368894Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8369161Z         contiguous: bool,
2025-05-07T20:32:53.8369411Z         compiled: bool,
2025-05-07T20:32:53.8369643Z     ) -> None:
2025-05-07T20:32:53.8369860Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8370101Z     
2025-05-07T20:32:53.8370374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8370711Z     
2025-05-07T20:32:53.8370901Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.8371199Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.8371511Z         x = x_sign * x_clamp
2025-05-07T20:32:53.8371747Z         x0 = x[:, :D]
2025-05-07T20:32:53.8371965Z         x1 = x[:, D:]
2025-05-07T20:32:53.8372176Z     
2025-05-07T20:32:53.8372353Z         if contiguous:
2025-05-07T20:32:53.8372585Z             x0 = x0.contiguous()
2025-05-07T20:32:53.8372841Z             x1 = x1.contiguous()
2025-05-07T20:32:53.8373400Z     
2025-05-07T20:32:53.8373593Z         if scale_ub is not None:
2025-05-07T20:32:53.8373865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.8374196Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.8374510Z             )
2025-05-07T20:32:53.8374698Z         else:
2025-05-07T20:32:53.8374912Z             scale_ub_tensor = None
2025-05-07T20:32:53.8375160Z     
2025-05-07T20:32:53.8375385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.8375695Z             op = silu_mul_quant
2025-05-07T20:32:53.8376034Z             if compiled:
2025-05-07T20:32:53.8376282Z                 op = torch.compile(op)
2025-05-07T20:32:53.8376703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8376980Z     
2025-05-07T20:32:53.8377176Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.8377340Z 
2025-05-07T20:32:53.8377440Z moe/activation_test.py:117: 
2025-05-07T20:32:53.8377739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8378076Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.8378357Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8379050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.8379860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.8380429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.8387662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.8388366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.8388912Z     kernel = self.compile(
2025-05-07T20:32:53.8389468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.8390683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8391100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8391363Z 
2025-05-07T20:32:53.8391655Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254431b40>
2025-05-07T20:32:53.8392944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.8394355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255071e10>}
2025-05-07T20:32:53.8395716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.8396758Z context = <triton._C.libtriton.ir.context object at 0x7f3133c70c70>
2025-05-07T20:32:53.8397052Z 
2025-05-07T20:32:53.8397234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.8397772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8398249Z                            module_map=module_map)
2025-05-07T20:32:53.8398638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8399013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8399291Z E       ^
2025-05-07T20:32:53.8399774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8400237Z 
2025-05-07T20:32:53.8400677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.8401195Z 
2025-05-07T20:32:53.8401313Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8401830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8402244Z     T=128,
2025-05-07T20:32:53.8402450Z     D=5120,
2025-05-07T20:32:53.8402654Z     scale_ub=None,
2025-05-07T20:32:53.8402885Z     contiguous=False,
2025-05-07T20:32:53.8403126Z     compiled=True,
2025-05-07T20:32:53.8403338Z )
2025-05-07T20:32:53.8403674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8404179Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.8404525Z 
2025-05-07T20:32:53.8404619Z     @given(
2025-05-07T20:32:53.8404910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8405233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8405558Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8405895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8406242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8406579Z     )
2025-05-07T20:32:53.8406953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8407407Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8407667Z         self,
2025-05-07T20:32:53.8407871Z         T: int,
2025-05-07T20:32:53.8408084Z         D: int,
2025-05-07T20:32:53.8408318Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8408599Z         contiguous: bool,
2025-05-07T20:32:53.8408855Z         compiled: bool,
2025-05-07T20:32:53.8409101Z     ) -> None:
2025-05-07T20:32:53.8409327Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8409585Z     
2025-05-07T20:32:53.8409884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8410242Z     
2025-05-07T20:32:53.8410448Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.8410758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.8411089Z         x = x_sign * x_clamp
2025-05-07T20:32:53.8411395Z         x0 = x[:, :D]
2025-05-07T20:32:53.8411635Z         x1 = x[:, D:]
2025-05-07T20:32:53.8411866Z     
2025-05-07T20:32:53.8412062Z         if contiguous:
2025-05-07T20:32:53.8412311Z             x0 = x0.contiguous()
2025-05-07T20:32:53.8412588Z             x1 = x1.contiguous()
2025-05-07T20:32:53.8412840Z     
2025-05-07T20:32:53.8413053Z         if scale_ub is not None:
2025-05-07T20:32:53.8413349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.8413694Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.8414022Z             )
2025-05-07T20:32:53.8414231Z         else:
2025-05-07T20:32:53.8414454Z             scale_ub_tensor = None
2025-05-07T20:32:53.8414728Z     
2025-05-07T20:32:53.8414981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.8415313Z             op = silu_mul_quant
2025-05-07T20:32:53.8415574Z             if compiled:
2025-05-07T20:32:53.8415829Z                 op = torch.compile(op)
2025-05-07T20:32:53.8416151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8416435Z     
2025-05-07T20:32:53.8416643Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.8416814Z 
2025-05-07T20:32:53.8416917Z moe/activation_test.py:117: 
2025-05-07T20:32:53.8417227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8417569Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.8417856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8418423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.8418994Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.8419654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.8420497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.8421046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.8421793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.8422453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.8422995Z     kernel = self.compile(
2025-05-07T20:32:53.8423547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.8424217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8424663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8424902Z 
2025-05-07T20:32:53.8425185Z self = <triton.compiler.compiler.ASTSource object at 0x7f32542b60b0>
2025-05-07T20:32:53.8426275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.8427647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255070310>}
2025-05-07T20:32:53.8428980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.8430009Z context = <triton._C.libtriton.ir.context object at 0x7f3133c422f0>
2025-05-07T20:32:53.8430307Z 
2025-05-07T20:32:53.8430480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.8431010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8431479Z                            module_map=module_map)
2025-05-07T20:32:53.8431852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8432262Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8432526Z E       ^
2025-05-07T20:32:53.8433101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8433589Z 
2025-05-07T20:32:53.8434015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.8434528Z 
2025-05-07T20:32:53.8434639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8435056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8435461Z     T=128,
2025-05-07T20:32:53.8435658Z     D=7168,
2025-05-07T20:32:53.8435856Z     scale_ub=1200.0,
2025-05-07T20:32:53.8436090Z     contiguous=False,
2025-05-07T20:32:53.8436323Z     compiled=False,
2025-05-07T20:32:53.8436532Z )
2025-05-07T20:32:53.9697884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9698661Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.9699039Z 
2025-05-07T20:32:53.9699158Z     @given(
2025-05-07T20:32:53.9699440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9699835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9700141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9700476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9700809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9701103Z     )
2025-05-07T20:32:53.9701458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9701910Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9702159Z         self,
2025-05-07T20:32:53.9702353Z         T: int,
2025-05-07T20:32:53.9702553Z         D: int,
2025-05-07T20:32:53.9702783Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9703350Z         contiguous: bool,
2025-05-07T20:32:53.9703600Z         compiled: bool,
2025-05-07T20:32:53.9703831Z     ) -> None:
2025-05-07T20:32:53.9704048Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9704300Z     
2025-05-07T20:32:53.9704581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9704916Z     
2025-05-07T20:32:53.9705115Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9705413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9705718Z         x = x_sign * x_clamp
2025-05-07T20:32:53.9706044Z         x0 = x[:, :D]
2025-05-07T20:32:53.9706270Z         x1 = x[:, D:]
2025-05-07T20:32:53.9706478Z     
2025-05-07T20:32:53.9706668Z         if contiguous:
2025-05-07T20:32:53.9706985Z             x0 = x0.contiguous()
2025-05-07T20:32:53.9707244Z             x1 = x1.contiguous()
2025-05-07T20:32:53.9707489Z     
2025-05-07T20:32:53.9707684Z         if scale_ub is not None:
2025-05-07T20:32:53.9707958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.9708299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.9708614Z             )
2025-05-07T20:32:53.9708809Z         else:
2025-05-07T20:32:53.9709019Z             scale_ub_tensor = None
2025-05-07T20:32:53.9709272Z     
2025-05-07T20:32:53.9709509Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.9709818Z             op = silu_mul_quant
2025-05-07T20:32:53.9710074Z             if compiled:
2025-05-07T20:32:53.9710324Z                 op = torch.compile(op)
2025-05-07T20:32:53.9710625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9710903Z     
2025-05-07T20:32:53.9711103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.9711273Z 
2025-05-07T20:32:53.9711374Z moe/activation_test.py:117: 
2025-05-07T20:32:53.9711673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9712006Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.9712292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9713050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.9713747Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.9714282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.9714953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.9715613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.9716147Z     kernel = self.compile(
2025-05-07T20:32:53.9716691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.9717338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.9717735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9717969Z 
2025-05-07T20:32:53.9718184Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133c19300>
2025-05-07T20:32:53.9719256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.9720644Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f35900>}
2025-05-07T20:32:53.9721983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.9723018Z context = <triton._C.libtriton.ir.context object at 0x7f3133b7aab0>
2025-05-07T20:32:53.9723354Z 
2025-05-07T20:32:53.9723530Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.9724052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.9724530Z                            module_map=module_map)
2025-05-07T20:32:53.9724904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.9725259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.9725517Z E       ^
2025-05-07T20:32:53.9725987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.9726478Z 
2025-05-07T20:32:53.9726942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.9727458Z 
2025-05-07T20:32:53.9727572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9727979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9728392Z     T=128,
2025-05-07T20:32:53.9728592Z     D=5120,
2025-05-07T20:32:53.9728783Z     scale_ub=None,
2025-05-07T20:32:53.9729022Z     contiguous=False,
2025-05-07T20:32:53.9729255Z     compiled=False,
2025-05-07T20:32:53.9729467Z )
2025-05-07T20:32:53.9729785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9730292Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.9730559Z 
2025-05-07T20:32:53.9730652Z     @given(
2025-05-07T20:32:53.9730888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9731206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9731522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9731845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9732180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9732471Z     )
2025-05-07T20:32:53.9732875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9733316Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9733566Z         self,
2025-05-07T20:32:53.9733764Z         T: int,
2025-05-07T20:32:53.9733961Z         D: int,
2025-05-07T20:32:53.9734191Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9734473Z         contiguous: bool,
2025-05-07T20:32:53.9734709Z         compiled: bool,
2025-05-07T20:32:53.9734936Z     ) -> None:
2025-05-07T20:32:53.9735156Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9735393Z     
2025-05-07T20:32:53.9735674Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9736014Z     
2025-05-07T20:32:53.9736210Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9736537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9736888Z         x = x_sign * x_clamp
2025-05-07T20:32:53.9737127Z         x0 = x[:, :D]
2025-05-07T20:32:53.9737348Z         x1 = x[:, D:]
2025-05-07T20:32:53.9737564Z     
2025-05-07T20:32:53.9737748Z         if contiguous:
2025-05-07T20:32:53.9737990Z             x0 = x0.contiguous()
2025-05-07T20:32:53.9738252Z             x1 = x1.contiguous()
2025-05-07T20:32:53.9738496Z     
2025-05-07T20:32:53.9738688Z         if scale_ub is not None:
2025-05-07T20:32:53.9738960Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.9739294Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.9739598Z             )
2025-05-07T20:32:53.9739854Z         else:
2025-05-07T20:32:53.9740074Z             scale_ub_tensor = None
2025-05-07T20:32:53.9740323Z     
2025-05-07T20:32:53.9740562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.9740882Z             op = silu_mul_quant
2025-05-07T20:32:53.9741136Z             if compiled:
2025-05-07T20:32:53.9741389Z                 op = torch.compile(op)
2025-05-07T20:32:53.9741690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9741961Z     
2025-05-07T20:32:53.9742217Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.9742394Z 
2025-05-07T20:32:53.9742495Z moe/activation_test.py:117: 
2025-05-07T20:32:53.9742801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9743128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.9743422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9744109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.9744790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.9745422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.9746121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.9746786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.9747317Z     kernel = self.compile(
2025-05-07T20:32:53.9747862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.9748517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.9748908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9749145Z 
2025-05-07T20:32:53.9749352Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133c25900>
2025-05-07T20:32:53.9750427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.9751813Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf2a70>}
2025-05-07T20:32:53.9753189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.9754217Z context = <triton._C.libtriton.ir.context object at 0x7f3133b89af0>
2025-05-07T20:32:53.9754507Z 
2025-05-07T20:32:53.9754673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.9755195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.9755663Z                            module_map=module_map)
2025-05-07T20:32:53.9756025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.9756385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.9756645Z E       ^
2025-05-07T20:32:53.9757101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.9757552Z 
2025-05-07T20:32:53.9757976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.9758487Z 
2025-05-07T20:32:53.9758592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9759007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9759403Z     T=128,
2025-05-07T20:32:53.9759593Z     D=5120,
2025-05-07T20:32:53.9759791Z     scale_ub=1200.0,
2025-05-07T20:32:53.9760013Z     contiguous=True,
2025-05-07T20:32:53.9760241Z     compiled=False,
2025-05-07T20:32:53.9760450Z )
2025-05-07T20:32:54.1686836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1687614Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.1687993Z 
2025-05-07T20:32:54.1688101Z     @given(
2025-05-07T20:32:54.1688399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1688809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1689480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1690291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1690644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1690939Z     )
2025-05-07T20:32:54.1691292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1691734Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1691982Z         self,
2025-05-07T20:32:54.1692179Z         T: int,
2025-05-07T20:32:54.1692386Z         D: int,
2025-05-07T20:32:54.1692761Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1693044Z         contiguous: bool,
2025-05-07T20:32:54.1693354Z         compiled: bool,
2025-05-07T20:32:54.1693587Z     ) -> None:
2025-05-07T20:32:54.1693807Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1694050Z     
2025-05-07T20:32:54.1694331Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1694679Z     
2025-05-07T20:32:54.1694875Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1695170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1695486Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1695729Z         x0 = x[:, :D]
2025-05-07T20:32:54.1695954Z         x1 = x[:, D:]
2025-05-07T20:32:54.1696166Z     
2025-05-07T20:32:54.1696351Z         if contiguous:
2025-05-07T20:32:54.1696590Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1696853Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1697088Z     
2025-05-07T20:32:54.1697296Z         if scale_ub is not None:
2025-05-07T20:32:54.1697575Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1697921Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1698232Z             )
2025-05-07T20:32:54.1698430Z         else:
2025-05-07T20:32:54.1698650Z             scale_ub_tensor = None
2025-05-07T20:32:54.1698898Z     
2025-05-07T20:32:54.1699131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1699538Z             op = silu_mul_quant
2025-05-07T20:32:54.1699895Z             if compiled:
2025-05-07T20:32:54.1700156Z                 op = torch.compile(op)
2025-05-07T20:32:54.1700461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1700738Z     
2025-05-07T20:32:54.1700938Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.1701105Z 
2025-05-07T20:32:54.1701214Z moe/activation_test.py:117: 
2025-05-07T20:32:54.1701508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1701849Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.1702139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1702835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.1703529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.1704072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1704760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1705416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1705953Z     kernel = self.compile(
2025-05-07T20:32:54.1706496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1707146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1707542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1707776Z 
2025-05-07T20:32:54.1707984Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133c04700>
2025-05-07T20:32:54.1709058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1710500Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf32e0>}
2025-05-07T20:32:54.1711832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1712897Z context = <triton._C.libtriton.ir.context object at 0x7f3133b84eb0>
2025-05-07T20:32:54.1713186Z 
2025-05-07T20:32:54.1713391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1713913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1714382Z                            module_map=module_map)
2025-05-07T20:32:54.1714760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1715116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.1715376Z E       ^
2025-05-07T20:32:54.1715841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1716291Z 
2025-05-07T20:32:54.1716706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1717214Z 
2025-05-07T20:32:54.1717329Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1717749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1718147Z     T=1,
2025-05-07T20:32:54.1718341Z     D=7168,
2025-05-07T20:32:54.1718540Z     scale_ub=1200.0,
2025-05-07T20:32:54.1718758Z     contiguous=True,
2025-05-07T20:32:54.1718984Z     compiled=True,
2025-05-07T20:32:54.1719192Z )
2025-05-07T20:32:54.1719510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1720046Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.1720310Z 
2025-05-07T20:32:54.1720394Z     @given(
2025-05-07T20:32:54.1720627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1720944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1721257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1721591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1721920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1722214Z     )
2025-05-07T20:32:54.1722566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1723006Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1723257Z         self,
2025-05-07T20:32:54.1723461Z         T: int,
2025-05-07T20:32:54.1723660Z         D: int,
2025-05-07T20:32:54.1723890Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1724170Z         contiguous: bool,
2025-05-07T20:32:54.1724411Z         compiled: bool,
2025-05-07T20:32:54.1730988Z     ) -> None:
2025-05-07T20:32:54.1731234Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1731499Z     
2025-05-07T20:32:54.1731791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1732150Z     
2025-05-07T20:32:54.1732360Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1732657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1732982Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1733249Z         x0 = x[:, :D]
2025-05-07T20:32:54.1733481Z         x1 = x[:, D:]
2025-05-07T20:32:54.1733697Z     
2025-05-07T20:32:54.1733902Z         if contiguous:
2025-05-07T20:32:54.1734150Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1734415Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1734669Z     
2025-05-07T20:32:54.1734873Z         if scale_ub is not None:
2025-05-07T20:32:54.1735155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1735591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1735912Z             )
2025-05-07T20:32:54.1736113Z         else:
2025-05-07T20:32:54.1736342Z             scale_ub_tensor = None
2025-05-07T20:32:54.1736604Z     
2025-05-07T20:32:54.1736844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1737172Z             op = silu_mul_quant
2025-05-07T20:32:54.1737437Z             if compiled:
2025-05-07T20:32:54.1737693Z                 op = torch.compile(op)
2025-05-07T20:32:54.1738056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1738340Z     
2025-05-07T20:32:54.1738540Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.1738762Z 
2025-05-07T20:32:54.1738870Z moe/activation_test.py:117: 
2025-05-07T20:32:54.1739180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1739525Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.1739888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1740467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.1741045Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.1741712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.1742407Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.1742961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1743652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1744321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1744861Z     kernel = self.compile(
2025-05-07T20:32:54.1745410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1746130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1746536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1746772Z 
2025-05-07T20:32:54.1746981Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133b466e0>
2025-05-07T20:32:54.1748060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1749435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf30a0>}
2025-05-07T20:32:54.1750773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1751819Z context = <triton._C.libtriton.ir.context object at 0x7f3133a42030>
2025-05-07T20:32:54.1752116Z 
2025-05-07T20:32:54.1752291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1752823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1753292Z                            module_map=module_map)
2025-05-07T20:32:54.1753675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1754041Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.1754305Z E       ^
2025-05-07T20:32:54.1754783Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1755239Z 
2025-05-07T20:32:54.1755658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1756228Z 
2025-05-07T20:32:54.1756346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1756767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1757178Z     T=1,
2025-05-07T20:32:54.1757379Z     D=7168,
2025-05-07T20:32:54.1757580Z     scale_ub=1200.0,
2025-05-07T20:32:54.1757819Z     contiguous=False,
2025-05-07T20:32:54.1758058Z     compiled=True,
2025-05-07T20:32:54.1758278Z )
2025-05-07T20:32:54.3140144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3141204Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.3141497Z 
2025-05-07T20:32:54.3141687Z     @given(
2025-05-07T20:32:54.3141928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3142252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3142554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3142886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3143236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3143530Z     )
2025-05-07T20:32:54.3143878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3144321Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3144566Z         self,
2025-05-07T20:32:54.3144761Z         T: int,
2025-05-07T20:32:54.3144964Z         D: int,
2025-05-07T20:32:54.3145192Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3145464Z         contiguous: bool,
2025-05-07T20:32:54.3145713Z         compiled: bool,
2025-05-07T20:32:54.3145946Z     ) -> None:
2025-05-07T20:32:54.3146165Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3146416Z     
2025-05-07T20:32:54.3146719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3147087Z     
2025-05-07T20:32:54.3147292Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.3147592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.3147994Z         x = x_sign * x_clamp
2025-05-07T20:32:54.3148247Z         x0 = x[:, :D]
2025-05-07T20:32:54.3148468Z         x1 = x[:, D:]
2025-05-07T20:32:54.3148683Z     
2025-05-07T20:32:54.3148867Z         if contiguous:
2025-05-07T20:32:54.3149105Z             x0 = x0.contiguous()
2025-05-07T20:32:54.3149364Z             x1 = x1.contiguous()
2025-05-07T20:32:54.3149598Z     
2025-05-07T20:32:54.3149796Z         if scale_ub is not None:
2025-05-07T20:32:54.3150074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.3150407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.3150719Z             )
2025-05-07T20:32:54.3150915Z         else:
2025-05-07T20:32:54.3151128Z             scale_ub_tensor = None
2025-05-07T20:32:54.3151390Z     
2025-05-07T20:32:54.3151625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.3151938Z             op = silu_mul_quant
2025-05-07T20:32:54.3152193Z             if compiled:
2025-05-07T20:32:54.3152451Z                 op = torch.compile(op)
2025-05-07T20:32:54.3152745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3153024Z     
2025-05-07T20:32:54.3153225Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.3153393Z 
2025-05-07T20:32:54.3153500Z moe/activation_test.py:117: 
2025-05-07T20:32:54.3153792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3154127Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.3154428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3154988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.3155546Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.3156203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.3156895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.3157554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.3158237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.3158899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.3159429Z     kernel = self.compile(
2025-05-07T20:32:54.3159966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.3160668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.3161112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3161338Z 
2025-05-07T20:32:54.3161553Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133a14ca0>
2025-05-07T20:32:54.3162618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.3163992Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf24d0>}
2025-05-07T20:32:54.3165330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.3166352Z context = <triton._C.libtriton.ir.context object at 0x7f3133a010b0>
2025-05-07T20:32:54.3166667Z 
2025-05-07T20:32:54.3166859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.3167382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.3167853Z                            module_map=module_map)
2025-05-07T20:32:54.3168276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.3168631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.3168900Z E       ^
2025-05-07T20:32:54.3169367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.3169810Z 
2025-05-07T20:32:54.3170224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.3170740Z 
2025-05-07T20:32:54.3170848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3171263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3171664Z     T=1,
2025-05-07T20:32:54.3171846Z     D=7168,
2025-05-07T20:32:54.3172042Z     scale_ub=None,
2025-05-07T20:32:54.3172260Z     contiguous=False,
2025-05-07T20:32:54.3172485Z     compiled=True,
2025-05-07T20:32:54.3172692Z )
2025-05-07T20:32:54.5778627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5779690Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.5780327Z 
2025-05-07T20:32:54.5780453Z     @given(
2025-05-07T20:32:54.5780805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5781284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5781742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5782243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5782746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5783165Z     )
2025-05-07T20:32:54.5783703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5784367Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5784728Z         self,
2025-05-07T20:32:54.5785022Z         T: int,
2025-05-07T20:32:54.5785318Z         D: int,
2025-05-07T20:32:54.5785912Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5786315Z         contiguous: bool,
2025-05-07T20:32:54.5786640Z         compiled: bool,
2025-05-07T20:32:54.5786872Z     ) -> None:
2025-05-07T20:32:54.5787087Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5787331Z     
2025-05-07T20:32:54.5787607Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5787942Z     
2025-05-07T20:32:54.5788139Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5788435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5788830Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5789073Z         x0 = x[:, :D]
2025-05-07T20:32:54.5789293Z         x1 = x[:, D:]
2025-05-07T20:32:54.5789575Z     
2025-05-07T20:32:54.5789770Z         if contiguous:
2025-05-07T20:32:54.5790276Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5790540Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5790791Z     
2025-05-07T20:32:54.5790997Z         if scale_ub is not None:
2025-05-07T20:32:54.5791279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5791640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5791963Z             )
2025-05-07T20:32:54.5792170Z         else:
2025-05-07T20:32:54.5792382Z             scale_ub_tensor = None
2025-05-07T20:32:54.5792644Z     
2025-05-07T20:32:54.5792886Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5793201Z             op = silu_mul_quant
2025-05-07T20:32:54.5793462Z             if compiled:
2025-05-07T20:32:54.5793724Z                 op = torch.compile(op)
2025-05-07T20:32:54.5794034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5794318Z     
2025-05-07T20:32:54.5794530Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.5794817Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.5795118Z     
2025-05-07T20:32:54.5795367Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5795704Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.5796091Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.5796417Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.5796782Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.5797089Z     
2025-05-07T20:32:54.5797298Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.5797495Z 
2025-05-07T20:32:54.5797605Z moe/activation_test.py:126: 
2025-05-07T20:32:54.5797903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5798246Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.5798585Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.5799374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.5800124Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.5800692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5801381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5802065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.5802789Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.5803541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:54.5804296Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.5805021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.5805662Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.5806345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.5806868Z     fn()
2025-05-07T20:32:54.5807372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.5807963Z     self.fn.run(
2025-05-07T20:32:54.5808438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5808975Z     kernel = self.compile(
2025-05-07T20:32:54.5809521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5810315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5810727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5810958Z 
2025-05-07T20:32:54.5811168Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133aa7220>
2025-05-07T20:32:54.5812260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5813639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254552dd0>}
2025-05-07T20:32:54.5814973Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5815995Z context = <triton._C.libtriton.ir.context object at 0x7f3133e80e30>
2025-05-07T20:32:54.5816289Z 
2025-05-07T20:32:54.5816461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5817032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5817514Z                            module_map=module_map)
2025-05-07T20:32:54.5817884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5818248Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.5818524Z E       ^
2025-05-07T20:32:54.5818988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5819440Z 
2025-05-07T20:32:54.5819906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5820439Z 
2025-05-07T20:32:54.5820551Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5820967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5821369Z     T=1,
2025-05-07T20:32:54.5821564Z     D=5120,
2025-05-07T20:32:54.5821768Z     scale_ub=1200.0,
2025-05-07T20:32:54.5821998Z     contiguous=False,
2025-05-07T20:32:54.5822238Z     compiled=True,
2025-05-07T20:32:54.5822457Z )
2025-05-07T20:32:54.7514465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7515224Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.7515592Z 
2025-05-07T20:32:54.7515707Z     @given(
2025-05-07T20:32:54.7516016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7516373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7516681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7517029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7517372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7517663Z     )
2025-05-07T20:32:54.7518011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7518454Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7518705Z         self,
2025-05-07T20:32:54.7519213Z         T: int,
2025-05-07T20:32:54.7519422Z         D: int,
2025-05-07T20:32:54.7519647Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7519921Z         contiguous: bool,
2025-05-07T20:32:54.7520163Z         compiled: bool,
2025-05-07T20:32:54.7520397Z     ) -> None:
2025-05-07T20:32:54.7520620Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7520856Z     
2025-05-07T20:32:54.7521132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7521473Z     
2025-05-07T20:32:54.7521663Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.7522049Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.7522361Z         x = x_sign * x_clamp
2025-05-07T20:32:54.7522683Z         x0 = x[:, :D]
2025-05-07T20:32:54.7522910Z         x1 = x[:, D:]
2025-05-07T20:32:54.7523125Z     
2025-05-07T20:32:54.7523308Z         if contiguous:
2025-05-07T20:32:54.7523551Z             x0 = x0.contiguous()
2025-05-07T20:32:54.7523813Z             x1 = x1.contiguous()
2025-05-07T20:32:54.7524054Z     
2025-05-07T20:32:54.7524254Z         if scale_ub is not None:
2025-05-07T20:32:54.7524533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.7524867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.7525177Z             )
2025-05-07T20:32:54.7525375Z         else:
2025-05-07T20:32:54.7525591Z             scale_ub_tensor = None
2025-05-07T20:32:54.7525843Z     
2025-05-07T20:32:54.7526076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.7526395Z             op = silu_mul_quant
2025-05-07T20:32:54.7526649Z             if compiled:
2025-05-07T20:32:54.7526945Z                 op = torch.compile(op)
2025-05-07T20:32:54.7527255Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.7527528Z     
2025-05-07T20:32:54.7527724Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.7527892Z 
2025-05-07T20:32:54.7528002Z moe/activation_test.py:117: 
2025-05-07T20:32:54.7528413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.7528749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.7529034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.7529596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.7530147Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.7530804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.7531492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.7532026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.7532703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.7533370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.7533905Z     kernel = self.compile(
2025-05-07T20:32:54.7534442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.7535098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.7535490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.7535715Z 
2025-05-07T20:32:54.7535931Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133bde3b0>
2025-05-07T20:32:54.7536996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.7538369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254553eb0>}
2025-05-07T20:32:54.7539828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.7540860Z context = <triton._C.libtriton.ir.context object at 0x7f3133ec2ab0>
2025-05-07T20:32:54.7541144Z 
2025-05-07T20:32:54.7541311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.7541831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.7542350Z                            module_map=module_map)
2025-05-07T20:32:54.7542759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.7543111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.7543374Z E       ^
2025-05-07T20:32:54.7543846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.7544299Z 
2025-05-07T20:32:54.7544716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.7545239Z 
2025-05-07T20:32:54.7545347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7545764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7546166Z     T=1,
2025-05-07T20:32:54.7546352Z     D=5120,
2025-05-07T20:32:54.7546554Z     scale_ub=1200.0,
2025-05-07T20:32:54.7546781Z     contiguous=False,
2025-05-07T20:32:54.7547011Z     compiled=False,
2025-05-07T20:32:54.7547227Z )
2025-05-07T20:32:54.7547549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7548041Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.7548312Z 
2025-05-07T20:32:54.7548390Z     @given(
2025-05-07T20:32:54.7548623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7548985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7549293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7549628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7549957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7550236Z     )
2025-05-07T20:32:54.7550586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7551029Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7551266Z         self,
2025-05-07T20:32:54.7551462Z         T: int,
2025-05-07T20:32:54.7551670Z         D: int,
2025-05-07T20:32:54.7551887Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7552163Z         contiguous: bool,
2025-05-07T20:32:54.7552403Z         compiled: bool,
2025-05-07T20:32:54.7552625Z     ) -> None:
2025-05-07T20:32:54.7552841Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7553081Z     
2025-05-07T20:32:54.7553352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7553702Z     
2025-05-07T20:32:54.7553897Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.7554190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.7554497Z         x = x_sign * x_clamp
2025-05-07T20:32:54.7554741Z         x0 = x[:, :D]
2025-05-07T20:32:54.7554962Z         x1 = x[:, D:]
2025-05-07T20:32:54.7555164Z     
2025-05-07T20:32:54.7555358Z         if contiguous:
2025-05-07T20:32:54.7555593Z             x0 = x0.contiguous()
2025-05-07T20:32:54.7555851Z             x1 = x1.contiguous()
2025-05-07T20:32:54.7556095Z     
2025-05-07T20:32:54.7556289Z         if scale_ub is not None:
2025-05-07T20:32:54.7556561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.7556900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.7557208Z             )
2025-05-07T20:32:54.7557398Z         else:
2025-05-07T20:32:54.7557618Z             scale_ub_tensor = None
2025-05-07T20:32:54.7557872Z     
2025-05-07T20:32:54.7558157Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.7565039Z             op = silu_mul_quant
2025-05-07T20:32:54.7565323Z             if compiled:
2025-05-07T20:32:54.7565593Z                 op = torch.compile(op)
2025-05-07T20:32:54.7565897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.7566208Z     
2025-05-07T20:32:54.7566417Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.7566588Z 
2025-05-07T20:32:54.7566701Z moe/activation_test.py:117: 
2025-05-07T20:32:54.7567003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.7567436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.7567774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.7568476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.7569169Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.7569723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.7570412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.7571084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.7571628Z     kernel = self.compile(
2025-05-07T20:32:54.7572181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.7572849Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.7573249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.7573489Z 
2025-05-07T20:32:54.7573698Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133e31c30>
2025-05-07T20:32:54.7574827Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.7576196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf0940>}
2025-05-07T20:32:54.7577550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.7578592Z context = <triton._C.libtriton.ir.context object at 0x7f31337b8130>
2025-05-07T20:32:54.7578889Z 
2025-05-07T20:32:54.7579060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.7579599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.7580165Z                            module_map=module_map)
2025-05-07T20:32:54.7580550Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.7580914Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.7581186Z E       ^
2025-05-07T20:32:54.7581658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.7582117Z 
2025-05-07T20:32:54.7582540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.7583053Z 
2025-05-07T20:32:54.7583170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7583592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7584007Z     T=16384,
2025-05-07T20:32:54.7584216Z     D=5120,
2025-05-07T20:32:54.7584424Z     scale_ub=1200.0,
2025-05-07T20:32:54.7584652Z     contiguous=False,
2025-05-07T20:32:54.7584884Z     compiled=True,
2025-05-07T20:32:54.7585099Z )
2025-05-07T20:32:54.8586118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8587060Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.8587344Z 
2025-05-07T20:32:54.8587422Z     @given(
2025-05-07T20:32:54.8587665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8587988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8588294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8588633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8589231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8589522Z     )
2025-05-07T20:32:54.8590243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8590697Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8590949Z         self,
2025-05-07T20:32:54.8591144Z         T: int,
2025-05-07T20:32:54.8591350Z         D: int,
2025-05-07T20:32:54.8591577Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8591866Z         contiguous: bool,
2025-05-07T20:32:54.8592121Z         compiled: bool,
2025-05-07T20:32:54.8592357Z     ) -> None:
2025-05-07T20:32:54.8592575Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8592827Z     
2025-05-07T20:32:54.8593111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8593453Z     
2025-05-07T20:32:54.8593654Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8593951Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8594263Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8594517Z         x0 = x[:, :D]
2025-05-07T20:32:54.8594745Z         x1 = x[:, D:]
2025-05-07T20:32:54.8594951Z     
2025-05-07T20:32:54.8595156Z         if contiguous:
2025-05-07T20:32:54.8595389Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8595651Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8595894Z     
2025-05-07T20:32:54.8596086Z         if scale_ub is not None:
2025-05-07T20:32:54.8596455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8596803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8597117Z             )
2025-05-07T20:32:54.8597308Z         else:
2025-05-07T20:32:54.8597529Z             scale_ub_tensor = None
2025-05-07T20:32:54.8597780Z     
2025-05-07T20:32:54.8598014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8598329Z             op = silu_mul_quant
2025-05-07T20:32:54.8598593Z             if compiled:
2025-05-07T20:32:54.8598842Z                 op = torch.compile(op)
2025-05-07T20:32:54.8599145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8599424Z     
2025-05-07T20:32:54.8599618Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8599790Z 
2025-05-07T20:32:54.8599892Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8600192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8600524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8600820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8601386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8601951Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8602614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8603301Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8603840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8604517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8605184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8605712Z     kernel = self.compile(
2025-05-07T20:32:54.8606259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8606993Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8607393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8607620Z 
2025-05-07T20:32:54.8607830Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133e32260>
2025-05-07T20:32:54.8608904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8610430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337c88b0>}
2025-05-07T20:32:54.8611765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8612788Z context = <triton._C.libtriton.ir.context object at 0x7f313371ccf0>
2025-05-07T20:32:54.8613073Z 
2025-05-07T20:32:54.8613248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8613762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8614230Z                            module_map=module_map)
2025-05-07T20:32:54.8614601Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8614956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8615214Z E       ^
2025-05-07T20:32:54.8615685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8616130Z 
2025-05-07T20:32:54.8616594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8617111Z 
2025-05-07T20:32:54.8617221Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8617630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8618034Z     T=2048,
2025-05-07T20:32:54.8618228Z     D=7168,
2025-05-07T20:32:54.8618419Z     scale_ub=1200.0,
2025-05-07T20:32:54.8618651Z     contiguous=False,
2025-05-07T20:32:54.8618880Z     compiled=True,
2025-05-07T20:32:54.8619082Z )
2025-05-07T20:32:54.8619404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8619971Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.8620249Z 
2025-05-07T20:32:54.8620329Z     @given(
2025-05-07T20:32:54.8620565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8620885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8621196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8621526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8621865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8622154Z     )
2025-05-07T20:32:54.8622503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8622947Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8623191Z         self,
2025-05-07T20:32:54.8623382Z         T: int,
2025-05-07T20:32:54.8623586Z         D: int,
2025-05-07T20:32:54.8623813Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8624088Z         contiguous: bool,
2025-05-07T20:32:54.8624331Z         compiled: bool,
2025-05-07T20:32:54.8624559Z     ) -> None:
2025-05-07T20:32:54.8624772Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8625020Z     
2025-05-07T20:32:54.8625295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8625640Z     
2025-05-07T20:32:54.8625831Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8626188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8626498Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8626757Z         x0 = x[:, :D]
2025-05-07T20:32:54.8627007Z         x1 = x[:, D:]
2025-05-07T20:32:54.8627217Z     
2025-05-07T20:32:54.8627402Z         if contiguous:
2025-05-07T20:32:54.8627638Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8627905Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8628141Z     
2025-05-07T20:32:54.8628340Z         if scale_ub is not None:
2025-05-07T20:32:54.8628618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8628997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8629347Z             )
2025-05-07T20:32:54.8629545Z         else:
2025-05-07T20:32:54.8629753Z             scale_ub_tensor = None
2025-05-07T20:32:54.8630013Z     
2025-05-07T20:32:54.8630251Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8630563Z             op = silu_mul_quant
2025-05-07T20:32:54.8630822Z             if compiled:
2025-05-07T20:32:54.8631075Z                 op = torch.compile(op)
2025-05-07T20:32:54.8631376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8631646Z     
2025-05-07T20:32:54.8631843Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8632012Z 
2025-05-07T20:32:54.8632119Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8632415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8632749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8633044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8633602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8634167Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8634824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8635515Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8636094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8636780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8637447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8637983Z     kernel = self.compile(
2025-05-07T20:32:54.8638521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8639189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8639589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8639816Z 
2025-05-07T20:32:54.8640026Z self = <triton.compiler.compiler.ASTSource object at 0x7f313372ab90>
2025-05-07T20:32:54.8641100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8642460Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337c9090>}
2025-05-07T20:32:54.8643811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8644855Z context = <triton._C.libtriton.ir.context object at 0x7f3254133730>
2025-05-07T20:32:54.8645140Z 
2025-05-07T20:32:54.8645314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8645832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8646349Z                            module_map=module_map)
2025-05-07T20:32:54.8646720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8647075Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8647341Z E       ^
2025-05-07T20:32:54.8647809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8648254Z 
2025-05-07T20:32:54.8648675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8649239Z 
2025-05-07T20:32:54.9940977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9941799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9942330Z     T=1,
2025-05-07T20:32:54.9942525Z     D=5120,
2025-05-07T20:32:54.9942727Z     scale_ub=None,
2025-05-07T20:32:54.9942946Z     contiguous=False,
2025-05-07T20:32:54.9943176Z     compiled=False,
2025-05-07T20:32:54.9943397Z )
2025-05-07T20:32:54.9943723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9944216Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.9944483Z 
2025-05-07T20:32:54.9944563Z     @given(
2025-05-07T20:32:54.9944794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9945105Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9945418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9945760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9946096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9946387Z     )
2025-05-07T20:32:54.9946752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9947190Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9947436Z         self,
2025-05-07T20:32:54.9947639Z         T: int,
2025-05-07T20:32:54.9947844Z         D: int,
2025-05-07T20:32:54.9948156Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9948442Z         contiguous: bool,
2025-05-07T20:32:54.9948695Z         compiled: bool,
2025-05-07T20:32:54.9948918Z     ) -> None:
2025-05-07T20:32:54.9949137Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9949383Z     
2025-05-07T20:32:54.9949656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9949999Z     
2025-05-07T20:32:54.9950200Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9950496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9950813Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9951058Z         x0 = x[:, :D]
2025-05-07T20:32:54.9951278Z         x1 = x[:, D:]
2025-05-07T20:32:54.9951494Z     
2025-05-07T20:32:54.9951685Z         if contiguous:
2025-05-07T20:32:54.9951917Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9952184Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9952435Z     
2025-05-07T20:32:54.9952643Z         if scale_ub is not None:
2025-05-07T20:32:54.9952925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9953273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9953592Z             )
2025-05-07T20:32:54.9953784Z         else:
2025-05-07T20:32:54.9954011Z             scale_ub_tensor = None
2025-05-07T20:32:54.9954272Z     
2025-05-07T20:32:54.9954507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9954839Z             op = silu_mul_quant
2025-05-07T20:32:54.9955100Z             if compiled:
2025-05-07T20:32:54.9955358Z                 op = torch.compile(op)
2025-05-07T20:32:54.9955670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9955960Z     
2025-05-07T20:32:54.9956155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9956328Z 
2025-05-07T20:32:54.9956435Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9956751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9957220Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9957514Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9958206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9958898Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9959431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9960109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9960897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9961440Z     kernel = self.compile(
2025-05-07T20:32:54.9961979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9962632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9963037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9963265Z 
2025-05-07T20:32:54.9963473Z self = <triton.compiler.compiler.ASTSource object at 0x7f31337d48e0>
2025-05-07T20:32:54.9964546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9965923Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337c97e0>}
2025-05-07T20:32:54.9967253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9968318Z context = <triton._C.libtriton.ir.context object at 0x7f32541d96b0>
2025-05-07T20:32:54.9968605Z 
2025-05-07T20:32:54.9968770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9969294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9969764Z                            module_map=module_map)
2025-05-07T20:32:54.9970135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9970489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9970760Z E       ^
2025-05-07T20:32:54.9971235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9971679Z 
2025-05-07T20:32:54.9972092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9972605Z 
2025-05-07T20:32:54.9972713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9973133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9973531Z     T=4096,
2025-05-07T20:32:54.9973719Z     D=7168,
2025-05-07T20:32:54.9973921Z     scale_ub=1200.0,
2025-05-07T20:32:54.9974154Z     contiguous=False,
2025-05-07T20:32:54.9974378Z     compiled=False,
2025-05-07T20:32:54.9974587Z )
2025-05-07T20:32:54.9974914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9975404Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9975696Z 
2025-05-07T20:32:54.9975773Z     @given(
2025-05-07T20:32:54.9976012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9976323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9976636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9976968Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9977301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9977641Z     )
2025-05-07T20:32:54.9977996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9978440Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9978684Z         self,
2025-05-07T20:32:54.9978887Z         T: int,
2025-05-07T20:32:54.9979087Z         D: int,
2025-05-07T20:32:54.9979310Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9979586Z         contiguous: bool,
2025-05-07T20:32:54.9979948Z         compiled: bool,
2025-05-07T20:32:54.9980182Z     ) -> None:
2025-05-07T20:32:54.9980502Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9980759Z     
2025-05-07T20:32:54.9981082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9981446Z     
2025-05-07T20:32:54.9981651Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9981948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9982270Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9982530Z         x0 = x[:, :D]
2025-05-07T20:32:54.9982762Z         x1 = x[:, D:]
2025-05-07T20:32:54.9982976Z     
2025-05-07T20:32:54.9983172Z         if contiguous:
2025-05-07T20:32:54.9983418Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9983684Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9983932Z     
2025-05-07T20:32:54.9984137Z         if scale_ub is not None:
2025-05-07T20:32:54.9984416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9984757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9985079Z             )
2025-05-07T20:32:54.9985280Z         else:
2025-05-07T20:32:54.9985528Z             scale_ub_tensor = None
2025-05-07T20:32:54.9985795Z     
2025-05-07T20:32:54.9986038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9986356Z             op = silu_mul_quant
2025-05-07T20:32:54.9986623Z             if compiled:
2025-05-07T20:32:54.9986910Z                 op = torch.compile(op)
2025-05-07T20:32:54.9987292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9987585Z     
2025-05-07T20:32:54.9987799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9987969Z 
2025-05-07T20:32:54.9988076Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9988384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9988725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9989018Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9989711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9990848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9991405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9992078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9992747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9993287Z     kernel = self.compile(
2025-05-07T20:32:54.9993835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9994486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9994886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9995114Z 
2025-05-07T20:32:54.9995329Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254124d00>
2025-05-07T20:32:54.9996403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9997773Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337ca200>}
2025-05-07T20:32:54.9999217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.0000243Z context = <triton._C.libtriton.ir.context object at 0x7f3254178eb0>
2025-05-07T20:32:55.0000534Z 
2025-05-07T20:32:55.0000710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.0001309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.0001844Z                            module_map=module_map)
2025-05-07T20:32:55.0002221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.0002582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.0002845Z E       ^
2025-05-07T20:32:55.0003320Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.0003777Z 
2025-05-07T20:32:55.0004201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.0004715Z 
2025-05-07T20:32:55.0004828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.0005243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.0005654Z     T=16384,
2025-05-07T20:32:55.0005858Z     D=7168,
2025-05-07T20:32:55.0006064Z     scale_ub=None,
2025-05-07T20:32:55.0006290Z     contiguous=True,
2025-05-07T20:32:55.0006520Z     compiled=True,
2025-05-07T20:32:55.0006725Z )
2025-05-07T20:32:55.1947980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1948628Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.1949017Z 
2025-05-07T20:32:55.1949132Z     @given(
2025-05-07T20:32:55.1949715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1950132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1950460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1950788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1951127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1951420Z     )
2025-05-07T20:32:55.1951772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1952222Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1952480Z         self,
2025-05-07T20:32:55.1952676Z         T: int,
2025-05-07T20:32:55.1952883Z         D: int,
2025-05-07T20:32:55.1953115Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1953386Z         contiguous: bool,
2025-05-07T20:32:55.1953641Z         compiled: bool,
2025-05-07T20:32:55.1953877Z     ) -> None:
2025-05-07T20:32:55.1954103Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1954351Z     
2025-05-07T20:32:55.1954639Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1954987Z     
2025-05-07T20:32:55.1955180Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1955479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1955798Z         x = x_sign * x_clamp
2025-05-07T20:32:55.1956042Z         x0 = x[:, :D]
2025-05-07T20:32:55.1956268Z         x1 = x[:, D:]
2025-05-07T20:32:55.1956486Z     
2025-05-07T20:32:55.1956677Z         if contiguous:
2025-05-07T20:32:55.1956918Z             x0 = x0.contiguous()
2025-05-07T20:32:55.1957187Z             x1 = x1.contiguous()
2025-05-07T20:32:55.1957423Z     
2025-05-07T20:32:55.1957625Z         if scale_ub is not None:
2025-05-07T20:32:55.1957919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.1958258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.1958569Z             )
2025-05-07T20:32:55.1958774Z         else:
2025-05-07T20:32:55.1959095Z             scale_ub_tensor = None
2025-05-07T20:32:55.1959350Z     
2025-05-07T20:32:55.1959594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1959912Z             op = silu_mul_quant
2025-05-07T20:32:55.1960161Z             if compiled:
2025-05-07T20:32:55.1960418Z                 op = torch.compile(op)
2025-05-07T20:32:55.1960719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1960989Z     
2025-05-07T20:32:55.1961197Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.1961363Z 
2025-05-07T20:32:55.1961471Z moe/activation_test.py:117: 
2025-05-07T20:32:55.1961858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1962268Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.1962560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1963125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.1963689Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.1964356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.1965047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.1965577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.1966257Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.1966923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.1967466Z     kernel = self.compile(
2025-05-07T20:32:55.1968012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.1968669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.1969071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1969346Z 
2025-05-07T20:32:55.1969566Z self = <triton.compiler.compiler.ASTSource object at 0x7f32541b5630>
2025-05-07T20:32:55.1970634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.1972015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337cb760>}
2025-05-07T20:32:55.1973372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.1974399Z context = <triton._C.libtriton.ir.context object at 0x7f3133d65cf0>
2025-05-07T20:32:55.1974684Z 
2025-05-07T20:32:55.1974858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.1975383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.1975857Z                            module_map=module_map)
2025-05-07T20:32:55.1976223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1976580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.1976868Z E       ^
2025-05-07T20:32:55.1977365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.1977823Z 
2025-05-07T20:32:55.1978238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.1978752Z 
2025-05-07T20:32:55.1978858Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1979270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1979728Z     T=4096,
2025-05-07T20:32:55.1980000Z     D=5120,
2025-05-07T20:32:55.1980196Z     scale_ub=None,
2025-05-07T20:32:55.1980415Z     contiguous=False,
2025-05-07T20:32:55.1980643Z     compiled=True,
2025-05-07T20:32:55.1980851Z )
2025-05-07T20:32:55.1981172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1981659Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.1981936Z 
2025-05-07T20:32:55.1982013Z     @given(
2025-05-07T20:32:55.1982247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1982604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1982949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1983283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1983610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1983890Z     )
2025-05-07T20:32:55.1984246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1984692Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1984930Z         self,
2025-05-07T20:32:55.1985129Z         T: int,
2025-05-07T20:32:55.1985327Z         D: int,
2025-05-07T20:32:55.1985547Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1985819Z         contiguous: bool,
2025-05-07T20:32:55.1986063Z         compiled: bool,
2025-05-07T20:32:55.1986281Z     ) -> None:
2025-05-07T20:32:55.1986502Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1986747Z     
2025-05-07T20:32:55.1987015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1987360Z     
2025-05-07T20:32:55.1987562Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1987853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1988166Z         x = x_sign * x_clamp
2025-05-07T20:32:55.1988413Z         x0 = x[:, :D]
2025-05-07T20:32:55.1988633Z         x1 = x[:, D:]
2025-05-07T20:32:55.1988842Z     
2025-05-07T20:32:55.1989081Z         if contiguous:
2025-05-07T20:32:55.1989324Z             x0 = x0.contiguous()
2025-05-07T20:32:55.1989581Z             x1 = x1.contiguous()
2025-05-07T20:32:55.1990231Z     
2025-05-07T20:32:55.1990436Z         if scale_ub is not None:
2025-05-07T20:32:55.1990713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.1991054Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.1991364Z             )
2025-05-07T20:32:55.1991558Z         else:
2025-05-07T20:32:55.1991777Z             scale_ub_tensor = None
2025-05-07T20:32:55.1992040Z     
2025-05-07T20:32:55.1992271Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1992593Z             op = silu_mul_quant
2025-05-07T20:32:55.1992854Z             if compiled:
2025-05-07T20:32:55.1993101Z                 op = torch.compile(op)
2025-05-07T20:32:55.1993399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1993681Z     
2025-05-07T20:32:55.1993881Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.1994053Z 
2025-05-07T20:32:55.1994154Z moe/activation_test.py:117: 
2025-05-07T20:32:55.1994456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1994795Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.1995077Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1995640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.1996198Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.1996853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.1997542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.1998078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.1998752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.1999500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.2000029Z     kernel = self.compile(
2025-05-07T20:32:55.2000569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.2001222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.2001613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.2001911Z 
2025-05-07T20:32:55.2002118Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133d443d0>
2025-05-07T20:32:55.2003247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.2004608Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2c280>}
2025-05-07T20:32:55.2005933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.2006992Z context = <triton._C.libtriton.ir.context object at 0x7f3133d6c7b0>
2025-05-07T20:32:55.2007293Z 
2025-05-07T20:32:55.2007461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.2007991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.2008452Z                            module_map=module_map)
2025-05-07T20:32:55.2008823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.2009178Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.2009438Z E       ^
2025-05-07T20:32:55.2009970Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.2010431Z 
2025-05-07T20:32:55.2010844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.2011350Z 
2025-05-07T20:32:55.5333197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5333682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5334086Z     T=4096,
2025-05-07T20:32:55.5334303Z     D=5120,
2025-05-07T20:32:55.5334504Z     scale_ub=1200.0,
2025-05-07T20:32:55.5334730Z     contiguous=False,
2025-05-07T20:32:55.5334970Z     compiled=False,
2025-05-07T20:32:55.5335186Z )
2025-05-07T20:32:55.5335500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5336009Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.5336333Z 
2025-05-07T20:32:55.5336420Z     @given(
2025-05-07T20:32:55.5336658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5337128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5337437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5337773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5338106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5338394Z     )
2025-05-07T20:32:55.5338745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5339192Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5339437Z         self,
2025-05-07T20:32:55.5339633Z         T: int,
2025-05-07T20:32:55.5339882Z         D: int,
2025-05-07T20:32:55.5340105Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5340399Z         contiguous: bool,
2025-05-07T20:32:55.5340640Z         compiled: bool,
2025-05-07T20:32:55.5340875Z     ) -> None:
2025-05-07T20:32:55.5341401Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5341648Z     
2025-05-07T20:32:55.5341925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5342276Z     
2025-05-07T20:32:55.5342472Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5342774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5343088Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5343334Z         x0 = x[:, :D]
2025-05-07T20:32:55.5343564Z         x1 = x[:, D:]
2025-05-07T20:32:55.5343782Z     
2025-05-07T20:32:55.5343969Z         if contiguous:
2025-05-07T20:32:55.5344304Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5344568Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5344879Z     
2025-05-07T20:32:55.5345077Z         if scale_ub is not None:
2025-05-07T20:32:55.5345352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5345680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5345993Z             )
2025-05-07T20:32:55.5346192Z         else:
2025-05-07T20:32:55.5346410Z             scale_ub_tensor = None
2025-05-07T20:32:55.5346659Z     
2025-05-07T20:32:55.5346922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5347262Z             op = silu_mul_quant
2025-05-07T20:32:55.5347509Z             if compiled:
2025-05-07T20:32:55.5347761Z                 op = torch.compile(op)
2025-05-07T20:32:55.5348061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5348326Z     
2025-05-07T20:32:55.5348524Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5348693Z 
2025-05-07T20:32:55.5348803Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5349104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5349439Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5349724Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5350503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5351194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5351730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5352417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5353076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5353611Z     kernel = self.compile(
2025-05-07T20:32:55.5354158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5354815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5355212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5355445Z 
2025-05-07T20:32:55.5355653Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133d449a0>
2025-05-07T20:32:55.5356729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5358118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2d000>}
2025-05-07T20:32:55.5359444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5360459Z context = <triton._C.libtriton.ir.context object at 0x7f3133985c70>
2025-05-07T20:32:55.5360754Z 
2025-05-07T20:32:55.5360922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5361449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5361971Z                            module_map=module_map)
2025-05-07T20:32:55.5362345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5362702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5362965Z E       ^
2025-05-07T20:32:55.5363422Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5363873Z 
2025-05-07T20:32:55.5364294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5364853Z 
2025-05-07T20:32:55.5365033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5365449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5365847Z     T=4096,
2025-05-07T20:32:55.5366042Z     D=5120,
2025-05-07T20:32:55.5366237Z     scale_ub=1200.0,
2025-05-07T20:32:55.5366463Z     contiguous=False,
2025-05-07T20:32:55.5366695Z     compiled=True,
2025-05-07T20:32:55.5366901Z )
2025-05-07T20:32:55.5367216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5367706Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.5367976Z 
2025-05-07T20:32:55.5368059Z     @given(
2025-05-07T20:32:55.5368286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5368596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5368908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5369239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5369562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5369846Z     )
2025-05-07T20:32:55.5370195Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5370628Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5370873Z         self,
2025-05-07T20:32:55.5371120Z         T: int,
2025-05-07T20:32:55.5371315Z         D: int,
2025-05-07T20:32:55.5371537Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5371814Z         contiguous: bool,
2025-05-07T20:32:55.5372046Z         compiled: bool,
2025-05-07T20:32:55.5372271Z     ) -> None:
2025-05-07T20:32:55.5372486Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5372723Z     
2025-05-07T20:32:55.5372994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5373333Z     
2025-05-07T20:32:55.5373524Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5373820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5374134Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5374377Z         x0 = x[:, :D]
2025-05-07T20:32:55.5374589Z         x1 = x[:, D:]
2025-05-07T20:32:55.5374800Z     
2025-05-07T20:32:55.5374989Z         if contiguous:
2025-05-07T20:32:55.5375217Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5375482Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5375719Z     
2025-05-07T20:32:55.5375908Z         if scale_ub is not None:
2025-05-07T20:32:55.5376183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5376518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5376838Z             )
2025-05-07T20:32:55.5377065Z         else:
2025-05-07T20:32:55.5377289Z             scale_ub_tensor = None
2025-05-07T20:32:55.5377535Z     
2025-05-07T20:32:55.5377767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5378083Z             op = silu_mul_quant
2025-05-07T20:32:55.5378332Z             if compiled:
2025-05-07T20:32:55.5378586Z                 op = torch.compile(op)
2025-05-07T20:32:55.5378885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5379155Z     
2025-05-07T20:32:55.5379345Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5379517Z 
2025-05-07T20:32:55.5379616Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5380049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5380372Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5380655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5381214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5381765Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5382420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5383150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5383725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5384404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5385072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5385606Z     kernel = self.compile(
2025-05-07T20:32:55.5386144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5386799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5387243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5387468Z 
2025-05-07T20:32:55.5387683Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133d1c190>
2025-05-07T20:32:55.5388755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5390387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2c700>}
2025-05-07T20:32:55.5391795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5392813Z context = <triton._C.libtriton.ir.context object at 0x7f3133996230>
2025-05-07T20:32:55.5393099Z 
2025-05-07T20:32:55.5393276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5393789Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5394258Z                            module_map=module_map)
2025-05-07T20:32:55.5394629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5394981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5395244Z E       ^
2025-05-07T20:32:55.5395709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5396158Z 
2025-05-07T20:32:55.5396578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5397084Z 
2025-05-07T20:32:55.6682941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6683666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6684278Z     T=2048,
2025-05-07T20:32:55.6684563Z     D=7168,
2025-05-07T20:32:55.6684854Z     scale_ub=1200.0,
2025-05-07T20:32:55.6685197Z     contiguous=False,
2025-05-07T20:32:55.6685569Z     compiled=False,
2025-05-07T20:32:55.6685889Z )
2025-05-07T20:32:55.6686374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6687072Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.6687353Z 
2025-05-07T20:32:55.6687434Z     @given(
2025-05-07T20:32:55.6687671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6688262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6688571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6688909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6689238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6689520Z     )
2025-05-07T20:32:55.6690118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.6690563Z     def test_silu_mul_quant(
2025-05-07T20:32:55.6690803Z         self,
2025-05-07T20:32:55.6691098Z         T: int,
2025-05-07T20:32:55.6691301Z         D: int,
2025-05-07T20:32:55.6691523Z         scale_ub: Optional[float],
2025-05-07T20:32:55.6691903Z         contiguous: bool,
2025-05-07T20:32:55.6692148Z         compiled: bool,
2025-05-07T20:32:55.6692373Z     ) -> None:
2025-05-07T20:32:55.6692598Z         torch.manual_seed(2025)
2025-05-07T20:32:55.6692847Z     
2025-05-07T20:32:55.6693122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.6693476Z     
2025-05-07T20:32:55.6693678Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.6693972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.6694280Z         x = x_sign * x_clamp
2025-05-07T20:32:55.6694526Z         x0 = x[:, :D]
2025-05-07T20:32:55.6694751Z         x1 = x[:, D:]
2025-05-07T20:32:55.6694958Z     
2025-05-07T20:32:55.6695150Z         if contiguous:
2025-05-07T20:32:55.6695388Z             x0 = x0.contiguous()
2025-05-07T20:32:55.6695647Z             x1 = x1.contiguous()
2025-05-07T20:32:55.6695912Z     
2025-05-07T20:32:55.6696116Z         if scale_ub is not None:
2025-05-07T20:32:55.6696397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.6696747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.6697054Z             )
2025-05-07T20:32:55.6704553Z         else:
2025-05-07T20:32:55.6704792Z             scale_ub_tensor = None
2025-05-07T20:32:55.6705073Z     
2025-05-07T20:32:55.6705449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6705785Z             op = silu_mul_quant
2025-05-07T20:32:55.6706056Z             if compiled:
2025-05-07T20:32:55.6706319Z                 op = torch.compile(op)
2025-05-07T20:32:55.6706635Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6706933Z     
2025-05-07T20:32:55.6707134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.6707315Z 
2025-05-07T20:32:55.6707423Z moe/activation_test.py:117: 
2025-05-07T20:32:55.6707738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6708089Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.6708378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6709081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.6709785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.6710331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.6711024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.6711691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.6712232Z     kernel = self.compile(
2025-05-07T20:32:55.6712774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.6713434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6713840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6714069Z 
2025-05-07T20:32:55.6714282Z self = <triton.compiler.compiler.ASTSource object at 0x7f31339b0430>
2025-05-07T20:32:55.6715356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.6716797Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2d240>}
2025-05-07T20:32:55.6718134Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.6719208Z context = <triton._C.libtriton.ir.context object at 0x7f31339d7570>
2025-05-07T20:32:55.6719535Z 
2025-05-07T20:32:55.6719715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.6720234Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6720707Z                            module_map=module_map)
2025-05-07T20:32:55.6721089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6721445Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.6721713Z E       ^
2025-05-07T20:32:55.6722186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6722635Z 
2025-05-07T20:32:55.6723059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6723580Z 
2025-05-07T20:32:55.6723687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6724110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6724518Z     T=1,
2025-05-07T20:32:55.6724706Z     D=7168,
2025-05-07T20:32:55.6724913Z     scale_ub=None,
2025-05-07T20:32:55.6725140Z     contiguous=True,
2025-05-07T20:32:55.6725365Z     compiled=False,
2025-05-07T20:32:55.6725581Z )
2025-05-07T20:32:55.6725955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6726445Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.6726714Z 
2025-05-07T20:32:55.6726795Z     @given(
2025-05-07T20:32:55.6727037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6727357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6727661Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6727996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6728330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6728613Z     )
2025-05-07T20:32:55.6728972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.6729417Z     def test_silu_mul_quant(
2025-05-07T20:32:55.6729654Z         self,
2025-05-07T20:32:55.6729861Z         T: int,
2025-05-07T20:32:55.6730064Z         D: int,
2025-05-07T20:32:55.6730287Z         scale_ub: Optional[float],
2025-05-07T20:32:55.6730563Z         contiguous: bool,
2025-05-07T20:32:55.6730810Z         compiled: bool,
2025-05-07T20:32:55.6731041Z     ) -> None:
2025-05-07T20:32:55.6731257Z         torch.manual_seed(2025)
2025-05-07T20:32:55.6731506Z     
2025-05-07T20:32:55.6731784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.6732124Z     
2025-05-07T20:32:55.6732321Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.6732616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.6732927Z         x = x_sign * x_clamp
2025-05-07T20:32:55.6733173Z         x0 = x[:, :D]
2025-05-07T20:32:55.6733392Z         x1 = x[:, D:]
2025-05-07T20:32:55.6733605Z     
2025-05-07T20:32:55.6733791Z         if contiguous:
2025-05-07T20:32:55.6734020Z             x0 = x0.contiguous()
2025-05-07T20:32:55.6734281Z             x1 = x1.contiguous()
2025-05-07T20:32:55.6734518Z     
2025-05-07T20:32:55.6734711Z         if scale_ub is not None:
2025-05-07T20:32:55.6735042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.6735373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.6735678Z             )
2025-05-07T20:32:55.6735877Z         else:
2025-05-07T20:32:55.6736095Z             scale_ub_tensor = None
2025-05-07T20:32:55.6736342Z     
2025-05-07T20:32:55.6736580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6736898Z             op = silu_mul_quant
2025-05-07T20:32:55.6737169Z             if compiled:
2025-05-07T20:32:55.6737491Z                 op = torch.compile(op)
2025-05-07T20:32:55.6737788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6738060Z     
2025-05-07T20:32:55.6738289Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.6738465Z 
2025-05-07T20:32:55.6738564Z moe/activation_test.py:117: 
2025-05-07T20:32:55.6738864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6739193Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.6739477Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6740337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.6741027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.6741564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.6742239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.6742901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.6743428Z     kernel = self.compile(
2025-05-07T20:32:55.6743965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.6744612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6745059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6745282Z 
2025-05-07T20:32:55.6745486Z self = <triton.compiler.compiler.ASTSource object at 0x7f32541b9f00>
2025-05-07T20:32:55.6746551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.6747900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2e050>}
2025-05-07T20:32:55.6749228Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.6750233Z context = <triton._C.libtriton.ir.context object at 0x7f31338cf1f0>
2025-05-07T20:32:55.6750526Z 
2025-05-07T20:32:55.6750695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.6751214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6751677Z                            module_map=module_map)
2025-05-07T20:32:55.6752039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6752392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.6752654Z E       ^
2025-05-07T20:32:55.6753113Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6753559Z 
2025-05-07T20:32:55.6753972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6754482Z 
2025-05-07T20:32:55.6754590Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6755049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6755444Z     T=16384,
2025-05-07T20:32:55.6755639Z     D=7168,
2025-05-07T20:32:55.6755836Z     scale_ub=1200.0,
2025-05-07T20:32:55.6756057Z     contiguous=False,
2025-05-07T20:32:55.6756287Z     compiled=True,
2025-05-07T20:32:55.9389369Z )
2025-05-07T20:32:55.9389800Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9390618Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.9390906Z 
2025-05-07T20:32:55.9391290Z     @given(
2025-05-07T20:32:55.9391533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9391935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9392245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9392588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9392926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9393226Z     )
2025-05-07T20:32:55.9393589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9394038Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9394281Z         self,
2025-05-07T20:32:55.9394485Z         T: int,
2025-05-07T20:32:55.9394690Z         D: int,
2025-05-07T20:32:55.9394912Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9395189Z         contiguous: bool,
2025-05-07T20:32:55.9395434Z         compiled: bool,
2025-05-07T20:32:55.9395662Z     ) -> None:
2025-05-07T20:32:55.9395885Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9396136Z     
2025-05-07T20:32:55.9396414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9396760Z     
2025-05-07T20:32:55.9396960Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9397258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9397565Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9397812Z         x0 = x[:, :D]
2025-05-07T20:32:55.9398119Z         x1 = x[:, D:]
2025-05-07T20:32:55.9398331Z     
2025-05-07T20:32:55.9398524Z         if contiguous:
2025-05-07T20:32:55.9398764Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9399023Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9399265Z     
2025-05-07T20:32:55.9399466Z         if scale_ub is not None:
2025-05-07T20:32:55.9399736Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9400073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9400385Z             )
2025-05-07T20:32:55.9400580Z         else:
2025-05-07T20:32:55.9400799Z             scale_ub_tensor = None
2025-05-07T20:32:55.9401055Z     
2025-05-07T20:32:55.9401290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9401613Z             op = silu_mul_quant
2025-05-07T20:32:55.9401870Z             if compiled:
2025-05-07T20:32:55.9402122Z                 op = torch.compile(op)
2025-05-07T20:32:55.9402419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9402708Z     
2025-05-07T20:32:55.9402908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9403076Z 
2025-05-07T20:32:55.9403180Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9403489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9403854Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9404147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9404704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.9405272Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.9405937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9406629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9407162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9407936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9408600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9409136Z     kernel = self.compile(
2025-05-07T20:32:55.9409678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9410334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9410735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9411012Z 
2025-05-07T20:32:55.9411259Z self = <triton.compiler.compiler.ASTSource object at 0x7f31338e89d0>
2025-05-07T20:32:55.9412331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9413731Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2f490>}
2025-05-07T20:32:55.9415069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9416090Z context = <triton._C.libtriton.ir.context object at 0x7f31338cbb70>
2025-05-07T20:32:55.9416380Z 
2025-05-07T20:32:55.9416555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9417080Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9417551Z                            module_map=module_map)
2025-05-07T20:32:55.9417926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9418342Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9418609Z E       ^
2025-05-07T20:32:55.9419077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9419526Z 
2025-05-07T20:32:55.9420041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9420563Z 
2025-05-07T20:32:55.9420672Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9421089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9421496Z     T=1,
2025-05-07T20:32:55.9421686Z     D=7168,
2025-05-07T20:32:55.9421890Z     scale_ub=None,
2025-05-07T20:32:55.9422114Z     contiguous=False,
2025-05-07T20:32:55.9422344Z     compiled=False,
2025-05-07T20:32:55.9422560Z )
2025-05-07T20:32:55.9422881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9423382Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.9423647Z 
2025-05-07T20:32:55.9423728Z     @given(
2025-05-07T20:32:55.9423967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9424284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9424588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9424919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9425249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9425531Z     )
2025-05-07T20:32:55.9425888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9426330Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9426578Z         self,
2025-05-07T20:32:55.9426774Z         T: int,
2025-05-07T20:32:55.9426982Z         D: int,
2025-05-07T20:32:55.9427246Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9427530Z         contiguous: bool,
2025-05-07T20:32:55.9427776Z         compiled: bool,
2025-05-07T20:32:55.9428068Z     ) -> None:
2025-05-07T20:32:55.9428284Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9428536Z     
2025-05-07T20:32:55.9428815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9429155Z     
2025-05-07T20:32:55.9429354Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9429651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9429961Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9430208Z         x0 = x[:, :D]
2025-05-07T20:32:55.9430433Z         x1 = x[:, D:]
2025-05-07T20:32:55.9430688Z     
2025-05-07T20:32:55.9430887Z         if contiguous:
2025-05-07T20:32:55.9431129Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9431431Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9431681Z     
2025-05-07T20:32:55.9431885Z         if scale_ub is not None:
2025-05-07T20:32:55.9432165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9432498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9432817Z             )
2025-05-07T20:32:55.9433024Z         else:
2025-05-07T20:32:55.9433237Z             scale_ub_tensor = None
2025-05-07T20:32:55.9433501Z     
2025-05-07T20:32:55.9433739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9434051Z             op = silu_mul_quant
2025-05-07T20:32:55.9434315Z             if compiled:
2025-05-07T20:32:55.9434571Z                 op = torch.compile(op)
2025-05-07T20:32:55.9434867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9435153Z     
2025-05-07T20:32:55.9435354Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9435521Z 
2025-05-07T20:32:55.9435626Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9435930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9436264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9436551Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9437282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9437982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9438525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9439202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9439866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9440409Z     kernel = self.compile(
2025-05-07T20:32:55.9440948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9441595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9441989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9442216Z 
2025-05-07T20:32:55.9442429Z self = <triton.compiler.compiler.ASTSource object at 0x7f31339746d0>
2025-05-07T20:32:55.9443495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9444842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2f7f0>}
2025-05-07T20:32:55.9446179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9447209Z context = <triton._C.libtriton.ir.context object at 0x7f31334a57b0>
2025-05-07T20:32:55.9447493Z 
2025-05-07T20:32:55.9447666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9448235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9448709Z                            module_map=module_map)
2025-05-07T20:32:55.9449076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9449432Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9449688Z E       ^
2025-05-07T20:32:55.9450148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9450647Z 
2025-05-07T20:32:55.9451108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9451624Z 
2025-05-07T20:32:55.9451737Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9452140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9452539Z     T=2048,
2025-05-07T20:32:55.9452732Z     D=7168,
2025-05-07T20:32:55.9452925Z     scale_ub=None,
2025-05-07T20:32:55.9453145Z     contiguous=False,
2025-05-07T20:32:55.9453370Z     compiled=True,
2025-05-07T20:32:55.9453570Z )
2025-05-07T20:32:56.0451907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.0452476Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:56.0452757Z 
2025-05-07T20:32:56.0452837Z     @given(
2025-05-07T20:32:56.0453077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.0453421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.0453727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.0454068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.0454401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.0454682Z     )
2025-05-07T20:32:56.0455035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.0455773Z     def test_silu_mul_quant(
2025-05-07T20:32:56.0456014Z         self,
2025-05-07T20:32:56.0456212Z         T: int,
2025-05-07T20:32:56.0456409Z         D: int,
2025-05-07T20:32:56.0456622Z         scale_ub: Optional[float],
2025-05-07T20:32:56.0456895Z         contiguous: bool,
2025-05-07T20:32:56.0457136Z         compiled: bool,
2025-05-07T20:32:56.0457356Z     ) -> None:
2025-05-07T20:32:56.0457575Z         torch.manual_seed(2025)
2025-05-07T20:32:56.0457818Z     
2025-05-07T20:32:56.0458084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.0458432Z     
2025-05-07T20:32:56.0458627Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.0458915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.0459223Z         x = x_sign * x_clamp
2025-05-07T20:32:56.0459471Z         x0 = x[:, :D]
2025-05-07T20:32:56.0459693Z         x1 = x[:, D:]
2025-05-07T20:32:56.0460013Z     
2025-05-07T20:32:56.0460204Z         if contiguous:
2025-05-07T20:32:56.0460443Z             x0 = x0.contiguous()
2025-05-07T20:32:56.0460699Z             x1 = x1.contiguous()
2025-05-07T20:32:56.0460941Z     
2025-05-07T20:32:56.0461134Z         if scale_ub is not None:
2025-05-07T20:32:56.0461404Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.0461740Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.0462049Z             )
2025-05-07T20:32:56.0462237Z         else:
2025-05-07T20:32:56.0462452Z             scale_ub_tensor = None
2025-05-07T20:32:56.0462703Z     
2025-05-07T20:32:56.0462940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.0463254Z             op = silu_mul_quant
2025-05-07T20:32:56.0463512Z             if compiled:
2025-05-07T20:32:56.0463763Z                 op = torch.compile(op)
2025-05-07T20:32:56.0464064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.0464337Z     
2025-05-07T20:32:56.0464526Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.0464784Z 
2025-05-07T20:32:56.0464892Z moe/activation_test.py:117: 
2025-05-07T20:32:56.0465192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.0465525Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.0465808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.0466370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.0466929Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.0467578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.0468412Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.0468952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.0469626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.0470281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.0470810Z     kernel = self.compile(
2025-05-07T20:32:56.0471354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.0472007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.0472394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.0472629Z 
2025-05-07T20:32:56.0472832Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133811270>
2025-05-07T20:32:56.0473910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.0475334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d4af0>}
2025-05-07T20:32:56.0476666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.0477692Z context = <triton._C.libtriton.ir.context object at 0x7f31334aa8b0>
2025-05-07T20:32:56.0477980Z 
2025-05-07T20:32:56.0478149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.0478672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.0479134Z                            module_map=module_map)
2025-05-07T20:32:56.0479499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.0479847Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.0480097Z E       ^
2025-05-07T20:32:56.0480562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.0481020Z 
2025-05-07T20:32:56.0481431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.0481937Z 
2025-05-07T20:32:56.0482049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.0482453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.0482849Z     T=4096,
2025-05-07T20:32:56.0483035Z     D=7168,
2025-05-07T20:32:56.0483223Z     scale_ub=None,
2025-05-07T20:32:56.0483440Z     contiguous=False,
2025-05-07T20:32:56.0483665Z     compiled=True,
2025-05-07T20:32:56.0483869Z )
2025-05-07T20:32:56.0484188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.0484690Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:56.0484958Z 
2025-05-07T20:32:56.0485042Z     @given(
2025-05-07T20:32:56.0485323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.0485639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.0485965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.0493674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.0494029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.0494333Z     )
2025-05-07T20:32:56.0494707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.0495152Z     def test_silu_mul_quant(
2025-05-07T20:32:56.0495536Z         self,
2025-05-07T20:32:56.0495745Z         T: int,
2025-05-07T20:32:56.0495950Z         D: int,
2025-05-07T20:32:56.0496303Z         scale_ub: Optional[float],
2025-05-07T20:32:56.0496669Z         contiguous: bool,
2025-05-07T20:32:56.0496928Z         compiled: bool,
2025-05-07T20:32:56.0497181Z     ) -> None:
2025-05-07T20:32:56.0497440Z         torch.manual_seed(2025)
2025-05-07T20:32:56.0497697Z     
2025-05-07T20:32:56.0497978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.0498331Z     
2025-05-07T20:32:56.0498539Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.0498835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.0499160Z         x = x_sign * x_clamp
2025-05-07T20:32:56.0499414Z         x0 = x[:, :D]
2025-05-07T20:32:56.0499644Z         x1 = x[:, D:]
2025-05-07T20:32:56.0499988Z     
2025-05-07T20:32:56.0500186Z         if contiguous:
2025-05-07T20:32:56.0500426Z             x0 = x0.contiguous()
2025-05-07T20:32:56.0500711Z             x1 = x1.contiguous()
2025-05-07T20:32:56.0500961Z     
2025-05-07T20:32:56.0501161Z         if scale_ub is not None:
2025-05-07T20:32:56.0501448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.0501793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.0502103Z             )
2025-05-07T20:32:56.0502306Z         else:
2025-05-07T20:32:56.0502626Z             scale_ub_tensor = None
2025-05-07T20:32:56.0502888Z     
2025-05-07T20:32:56.0503124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.0503449Z             op = silu_mul_quant
2025-05-07T20:32:56.0503712Z             if compiled:
2025-05-07T20:32:56.0503965Z                 op = torch.compile(op)
2025-05-07T20:32:56.0504271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.0504552Z     
2025-05-07T20:32:56.0504748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.0504925Z 
2025-05-07T20:32:56.0505037Z moe/activation_test.py:117: 
2025-05-07T20:32:56.0505344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.0505683Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.0505976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.0506548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.0507195Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.0508120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.0508827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.0509375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.0510059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.0510732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.0511278Z     kernel = self.compile(
2025-05-07T20:32:56.0511840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.0512505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.0512938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.0513386Z 
2025-05-07T20:32:56.0513629Z self = <triton.compiler.compiler.ASTSource object at 0x7f31338d54b0>
2025-05-07T20:32:56.0514893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.0516281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d4280>}
2025-05-07T20:32:56.0517736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.0518817Z context = <triton._C.libtriton.ir.context object at 0x7f3133549430>
2025-05-07T20:32:56.0519141Z 
2025-05-07T20:32:56.0519314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.0519935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.0520409Z                            module_map=module_map)
2025-05-07T20:32:56.0520784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.0521146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.0521410Z E       ^
2025-05-07T20:32:56.0521876Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.0522332Z 
2025-05-07T20:32:56.0522752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.0523263Z 
2025-05-07T20:32:56.4039100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.4040033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.4041087Z     T=16384,
2025-05-07T20:32:56.4041489Z     D=5120,
2025-05-07T20:32:56.4041871Z     scale_ub=1200.0,
2025-05-07T20:32:56.4042309Z     contiguous=False,
2025-05-07T20:32:56.4042752Z     compiled=False,
2025-05-07T20:32:56.4043156Z )
2025-05-07T20:32:56.4043781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.4044777Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:56.4045331Z 
2025-05-07T20:32:56.4045494Z     @given(
2025-05-07T20:32:56.4045949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.4046572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.4047189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.4047639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.4047994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.4048275Z     )
2025-05-07T20:32:56.4048632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.4049074Z     def test_silu_mul_quant(
2025-05-07T20:32:56.4049321Z         self,
2025-05-07T20:32:56.4049516Z         T: int,
2025-05-07T20:32:56.4049706Z         D: int,
2025-05-07T20:32:56.4049930Z         scale_ub: Optional[float],
2025-05-07T20:32:56.4050200Z         contiguous: bool,
2025-05-07T20:32:56.4050439Z         compiled: bool,
2025-05-07T20:32:56.4050667Z     ) -> None:
2025-05-07T20:32:56.4050886Z         torch.manual_seed(2025)
2025-05-07T20:32:56.4051120Z     
2025-05-07T20:32:56.4051400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.4051741Z     
2025-05-07T20:32:56.4051939Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.4052236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.4052545Z         x = x_sign * x_clamp
2025-05-07T20:32:56.4052787Z         x0 = x[:, :D]
2025-05-07T20:32:56.4053009Z         x1 = x[:, D:]
2025-05-07T20:32:56.4053306Z     
2025-05-07T20:32:56.4053503Z         if contiguous:
2025-05-07T20:32:56.4053734Z             x0 = x0.contiguous()
2025-05-07T20:32:56.4053999Z             x1 = x1.contiguous()
2025-05-07T20:32:56.4054238Z     
2025-05-07T20:32:56.4054425Z         if scale_ub is not None:
2025-05-07T20:32:56.4054702Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.4055045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.4055351Z             )
2025-05-07T20:32:56.4055549Z         else:
2025-05-07T20:32:56.4055771Z             scale_ub_tensor = None
2025-05-07T20:32:56.4056104Z     
2025-05-07T20:32:56.4056340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.4056772Z             op = silu_mul_quant
2025-05-07T20:32:56.4057020Z             if compiled:
2025-05-07T20:32:56.4057274Z                 op = torch.compile(op)
2025-05-07T20:32:56.4057574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4057844Z     
2025-05-07T20:32:56.4058045Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.4058216Z 
2025-05-07T20:32:56.4058319Z moe/activation_test.py:117: 
2025-05-07T20:32:56.4058619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4058947Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.4059237Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4060088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.4060789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.4061328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.4062012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.4062671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.4063196Z     kernel = self.compile(
2025-05-07T20:32:56.4063785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.4064448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.4064839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4065068Z 
2025-05-07T20:32:56.4065272Z self = <triton.compiler.compiler.ASTSource object at 0x7f31334b5570>
2025-05-07T20:32:56.4066345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.4067764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d6d40>}
2025-05-07T20:32:56.4069107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.4070134Z context = <triton._C.libtriton.ir.context object at 0x7f313356a6b0>
2025-05-07T20:32:56.4070417Z 
2025-05-07T20:32:56.4070585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.4071103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.4071569Z                            module_map=module_map)
2025-05-07T20:32:56.4071930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.4072286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.4072548Z E       ^
2025-05-07T20:32:56.4073013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.4073511Z 
2025-05-07T20:32:56.4073934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.4074449Z 
2025-05-07T20:32:56.4074555Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.4074969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.4075369Z     T=16384,
2025-05-07T20:32:56.4075559Z     D=5120,
2025-05-07T20:32:56.4075756Z     scale_ub=1200.0,
2025-05-07T20:32:56.4075979Z     contiguous=True,
2025-05-07T20:32:56.4076196Z     compiled=True,
2025-05-07T20:32:56.4076447Z )
2025-05-07T20:32:56.4076769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.4077305Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:56.4077633Z 
2025-05-07T20:32:56.4077711Z     @given(
2025-05-07T20:32:56.4077945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.4078252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.4078568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.4078903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.4079232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.4079520Z     )
2025-05-07T20:32:56.4079872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.4080318Z     def test_silu_mul_quant(
2025-05-07T20:32:56.4080554Z         self,
2025-05-07T20:32:56.4080752Z         T: int,
2025-05-07T20:32:56.4080956Z         D: int,
2025-05-07T20:32:56.4081173Z         scale_ub: Optional[float],
2025-05-07T20:32:56.4081447Z         contiguous: bool,
2025-05-07T20:32:56.4081693Z         compiled: bool,
2025-05-07T20:32:56.4081915Z     ) -> None:
2025-05-07T20:32:56.4082135Z         torch.manual_seed(2025)
2025-05-07T20:32:56.4082387Z     
2025-05-07T20:32:56.4082663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.4083001Z     
2025-05-07T20:32:56.4083250Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.4083547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.4083850Z         x = x_sign * x_clamp
2025-05-07T20:32:56.4084093Z         x0 = x[:, :D]
2025-05-07T20:32:56.4084314Z         x1 = x[:, D:]
2025-05-07T20:32:56.4084519Z     
2025-05-07T20:32:56.4084705Z         if contiguous:
2025-05-07T20:32:56.4084945Z             x0 = x0.contiguous()
2025-05-07T20:32:56.4085200Z             x1 = x1.contiguous()
2025-05-07T20:32:56.4085441Z     
2025-05-07T20:32:56.4085642Z         if scale_ub is not None:
2025-05-07T20:32:56.4085910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.4086253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.4086568Z             )
2025-05-07T20:32:56.4086766Z         else:
2025-05-07T20:32:56.4086973Z             scale_ub_tensor = None
2025-05-07T20:32:56.4087223Z     
2025-05-07T20:32:56.4087462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.4087770Z             op = silu_mul_quant
2025-05-07T20:32:56.4088022Z             if compiled:
2025-05-07T20:32:56.4088270Z                 op = torch.compile(op)
2025-05-07T20:32:56.4088559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4088830Z     
2025-05-07T20:32:56.4089031Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.4089197Z 
2025-05-07T20:32:56.4089301Z moe/activation_test.py:117: 
2025-05-07T20:32:56.4089602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4090103Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.4090401Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.4090961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.4091535Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.4092201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.4092970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.4093508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.4094196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.4094859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.4095383Z     kernel = self.compile(
2025-05-07T20:32:56.4095990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.4096712Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.4097107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.4097340Z 
2025-05-07T20:32:56.4097551Z self = <triton.compiler.compiler.ASTSource object at 0x7f313342a920>
2025-05-07T20:32:56.4098636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.4100085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d6830>}
2025-05-07T20:32:56.4101427Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.4102442Z context = <triton._C.libtriton.ir.context object at 0x7f313357f070>
2025-05-07T20:32:56.4102735Z 
2025-05-07T20:32:56.4102905Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.4103495Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.4103966Z                            module_map=module_map)
2025-05-07T20:32:56.4104332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.4104695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.4104965Z E       ^
2025-05-07T20:32:56.4105428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.4105884Z 
2025-05-07T20:32:56.4106301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.4106817Z 
2025-05-07T20:32:56.5982231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.5982670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.5983151Z     T=16384,
2025-05-07T20:32:56.5983344Z     D=5120,
2025-05-07T20:32:56.5983543Z     scale_ub=None,
2025-05-07T20:32:56.5983772Z     contiguous=False,
2025-05-07T20:32:56.5984000Z     compiled=True,
2025-05-07T20:32:56.5984207Z )
2025-05-07T20:32:56.5984528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.5985015Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:56.5985295Z 
2025-05-07T20:32:56.5985371Z     @given(
2025-05-07T20:32:56.5985605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.5985917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.5986221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.5986550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.5986878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.5987160Z     )
2025-05-07T20:32:56.5987509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.5987952Z     def test_silu_mul_quant(
2025-05-07T20:32:56.5988293Z         self,
2025-05-07T20:32:56.5988493Z         T: int,
2025-05-07T20:32:56.5988696Z         D: int,
2025-05-07T20:32:56.5988913Z         scale_ub: Optional[float],
2025-05-07T20:32:56.5989188Z         contiguous: bool,
2025-05-07T20:32:56.5989430Z         compiled: bool,
2025-05-07T20:32:56.5989651Z     ) -> None:
2025-05-07T20:32:56.5990023Z         torch.manual_seed(2025)
2025-05-07T20:32:56.5990279Z     
2025-05-07T20:32:56.5990553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.5990894Z     
2025-05-07T20:32:56.5991199Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.5991496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.5991864Z         x = x_sign * x_clamp
2025-05-07T20:32:56.5992134Z         x0 = x[:, :D]
2025-05-07T20:32:56.5992368Z         x1 = x[:, D:]
2025-05-07T20:32:56.5992585Z     
2025-05-07T20:32:56.5992779Z         if contiguous:
2025-05-07T20:32:56.5993029Z             x0 = x0.contiguous()
2025-05-07T20:32:56.5993313Z             x1 = x1.contiguous()
2025-05-07T20:32:56.5993582Z     
2025-05-07T20:32:56.5993799Z         if scale_ub is not None:
2025-05-07T20:32:56.5994098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.5994475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.5994823Z             )
2025-05-07T20:32:56.5995021Z         else:
2025-05-07T20:32:56.5995248Z             scale_ub_tensor = None
2025-05-07T20:32:56.5995526Z     
2025-05-07T20:32:56.5995767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.5996122Z             op = silu_mul_quant
2025-05-07T20:32:56.5996398Z             if compiled:
2025-05-07T20:32:56.5996676Z                 op = torch.compile(op)
2025-05-07T20:32:56.5997001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5997307Z     
2025-05-07T20:32:56.5997511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.5997696Z 
2025-05-07T20:32:56.5997803Z moe/activation_test.py:117: 
2025-05-07T20:32:56.5998204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5998581Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.5998889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5999550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.6000218Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.6001010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.6001846Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.6002482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.6003304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.6004096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.6004733Z     kernel = self.compile(
2025-05-07T20:32:56.6005373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.6006156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.6006608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6006879Z 
2025-05-07T20:32:56.6007111Z self = <triton.compiler.compiler.ASTSource object at 0x7f31333d7a00>
2025-05-07T20:32:56.6008448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.6010168Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d7760>}
2025-05-07T20:32:56.6011902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.6013160Z context = <triton._C.libtriton.ir.context object at 0x7f3133315ab0>
2025-05-07T20:32:56.6013502Z 
2025-05-07T20:32:56.6013685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.6014302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.6014890Z                            module_map=module_map)
2025-05-07T20:32:56.6015341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.6015742Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.6016028Z E       ^
2025-05-07T20:32:56.6016567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.6017129Z 
2025-05-07T20:32:56.6017636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.6018263Z 
2025-05-07T20:32:56.6018382Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.6018857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.6019319Z     T=2048,
2025-05-07T20:32:56.6019520Z     D=5120,
2025-05-07T20:32:56.6019726Z     scale_ub=None,
2025-05-07T20:32:56.6020043Z     contiguous=False,
2025-05-07T20:32:56.6020272Z     compiled=True,
2025-05-07T20:32:56.6020480Z )
2025-05-07T20:32:56.7047035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.7047784Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:56.7048071Z 
2025-05-07T20:32:56.7048149Z     @given(
2025-05-07T20:32:56.7048383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.7048777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.7049089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.7049423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.7049756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.7050032Z     )
2025-05-07T20:32:56.7050386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.7050829Z     def test_silu_mul_quant(
2025-05-07T20:32:56.7051070Z         self,
2025-05-07T20:32:56.7051275Z         T: int,
2025-05-07T20:32:56.7051476Z         D: int,
2025-05-07T20:32:56.7051695Z         scale_ub: Optional[float],
2025-05-07T20:32:56.7051971Z         contiguous: bool,
2025-05-07T20:32:56.7052216Z         compiled: bool,
2025-05-07T20:32:56.7052433Z     ) -> None:
2025-05-07T20:32:56.7052652Z         torch.manual_seed(2025)
2025-05-07T20:32:56.7052893Z     
2025-05-07T20:32:56.7053167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.7053515Z     
2025-05-07T20:32:56.7053711Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.7053998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.7054311Z         x = x_sign * x_clamp
2025-05-07T20:32:56.7054558Z         x0 = x[:, :D]
2025-05-07T20:32:56.7054778Z         x1 = x[:, D:]
2025-05-07T20:32:56.7054985Z     
2025-05-07T20:32:56.7055173Z         if contiguous:
2025-05-07T20:32:56.7055410Z             x0 = x0.contiguous()
2025-05-07T20:32:56.7055670Z             x1 = x1.contiguous()
2025-05-07T20:32:56.7055911Z     
2025-05-07T20:32:56.7056107Z         if scale_ub is not None:
2025-05-07T20:32:56.7056375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.7056715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.7057026Z             )
2025-05-07T20:32:56.7057215Z         else:
2025-05-07T20:32:56.7063736Z             scale_ub_tensor = None
2025-05-07T20:32:56.7064132Z     
2025-05-07T20:32:56.7064395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.7064732Z             op = silu_mul_quant
2025-05-07T20:32:56.7064993Z             if compiled:
2025-05-07T20:32:56.7065255Z                 op = torch.compile(op)
2025-05-07T20:32:56.7065565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7065849Z     
2025-05-07T20:32:56.7066046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.7066222Z 
2025-05-07T20:32:56.7066328Z moe/activation_test.py:117: 
2025-05-07T20:32:56.7066712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7067051Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.7067405Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7067992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.7068549Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.7069215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.7069906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.7070444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.7071116Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.7071780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.7072320Z     kernel = self.compile(
2025-05-07T20:32:56.7072869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.7073517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.7073918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7074148Z 
2025-05-07T20:32:56.7074406Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133506350>
2025-05-07T20:32:56.7075486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.7076869Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31333783a0>}
2025-05-07T20:32:56.7078207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.7079237Z context = <triton._C.libtriton.ir.context object at 0x7f31333dc630>
2025-05-07T20:32:56.7079522Z 
2025-05-07T20:32:56.7079696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.7080216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.7080684Z                            module_map=module_map)
2025-05-07T20:32:56.7081058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.7081410Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.7081673Z E       ^
2025-05-07T20:32:56.7082144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.7082590Z 
2025-05-07T20:32:56.7083013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.7083530Z 
2025-05-07T20:32:56.7083636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.7084055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.7084452Z     T=2048,
2025-05-07T20:32:56.7085000Z     D=5120,
2025-05-07T20:32:56.7085200Z     scale_ub=1200.0,
2025-05-07T20:32:56.7085451Z     contiguous=False,
2025-05-07T20:32:56.7085685Z     compiled=True,
2025-05-07T20:32:56.7085899Z )
2025-05-07T20:32:56.7086218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.7086721Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:56.7086996Z 
2025-05-07T20:32:56.7087083Z     @given(
2025-05-07T20:32:56.7087318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.7087693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.7088013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.7088385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.7088726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.7089023Z     )
2025-05-07T20:32:56.7089385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.7090007Z     def test_silu_mul_quant(
2025-05-07T20:32:56.7090255Z         self,
2025-05-07T20:32:56.7090453Z         T: int,
2025-05-07T20:32:56.7090653Z         D: int,
2025-05-07T20:32:56.7090876Z         scale_ub: Optional[float],
2025-05-07T20:32:56.7091154Z         contiguous: bool,
2025-05-07T20:32:56.7091394Z         compiled: bool,
2025-05-07T20:32:56.7091626Z     ) -> None:
2025-05-07T20:32:56.7091852Z         torch.manual_seed(2025)
2025-05-07T20:32:56.7092091Z     
2025-05-07T20:32:56.7092369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.7092718Z     
2025-05-07T20:32:56.7092910Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.7093218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.7093534Z         x = x_sign * x_clamp
2025-05-07T20:32:56.7093782Z         x0 = x[:, :D]
2025-05-07T20:32:56.7094007Z         x1 = x[:, D:]
2025-05-07T20:32:56.7094215Z     
2025-05-07T20:32:56.7094410Z         if contiguous:
2025-05-07T20:32:56.7094725Z             x0 = x0.contiguous()
2025-05-07T20:32:56.7094988Z             x1 = x1.contiguous()
2025-05-07T20:32:56.7095233Z     
2025-05-07T20:32:56.7095432Z         if scale_ub is not None:
2025-05-07T20:32:56.7095705Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.7096045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.7096364Z             )
2025-05-07T20:32:56.7096562Z         else:
2025-05-07T20:32:56.7096783Z             scale_ub_tensor = None
2025-05-07T20:32:56.7097036Z     
2025-05-07T20:32:56.7097273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.7097592Z             op = silu_mul_quant
2025-05-07T20:32:56.7097850Z             if compiled:
2025-05-07T20:32:56.7098106Z                 op = torch.compile(op)
2025-05-07T20:32:56.7098400Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7098671Z     
2025-05-07T20:32:56.7098869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.7099035Z 
2025-05-07T20:32:56.7099140Z moe/activation_test.py:117: 
2025-05-07T20:32:56.7099444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7099850Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.7100137Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7100692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.7101254Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.7101913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.7102602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.7103138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.7103820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.7104568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.7105106Z     kernel = self.compile(
2025-05-07T20:32:56.7105650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.7106302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.7106697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7106929Z 
2025-05-07T20:32:56.7107207Z self = <triton.compiler.compiler.ASTSource object at 0x7f31333d6680>
2025-05-07T20:32:56.7108393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.7109771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133378820>}
2025-05-07T20:32:56.7111118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.7112142Z context = <triton._C.libtriton.ir.context object at 0x7f31332c6fb0>
2025-05-07T20:32:56.7112434Z 
2025-05-07T20:32:56.7112602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.7113127Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.7113597Z                            module_map=module_map)
2025-05-07T20:32:56.7113961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.7114317Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.7114573Z E       ^
2025-05-07T20:32:56.7115078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.7115528Z 
2025-05-07T20:32:56.7115939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.7116448Z 
2025-05-07T20:32:57.0801920Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.0802458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.0802928Z     T=4096,
2025-05-07T20:32:57.0803141Z     D=5120,
2025-05-07T20:32:57.0803353Z     scale_ub=1200.0,
2025-05-07T20:32:57.0803580Z     contiguous=True,
2025-05-07T20:32:57.0803810Z     compiled=True,
2025-05-07T20:32:57.0804029Z )
2025-05-07T20:32:57.0804355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.0804859Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.0805136Z 
2025-05-07T20:32:57.0805228Z     @given(
2025-05-07T20:32:57.0805464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.0805786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.0806103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.0806446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.0806785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.0807082Z     )
2025-05-07T20:32:57.0807444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.0807884Z     def test_silu_mul_quant(
2025-05-07T20:32:57.0808136Z         self,
2025-05-07T20:32:57.0808343Z         T: int,
2025-05-07T20:32:57.0808545Z         D: int,
2025-05-07T20:32:57.0808772Z         scale_ub: Optional[float],
2025-05-07T20:32:57.0809052Z         contiguous: bool,
2025-05-07T20:32:57.0809294Z         compiled: bool,
2025-05-07T20:32:57.0809524Z     ) -> None:
2025-05-07T20:32:57.0809748Z         torch.manual_seed(2025)
2025-05-07T20:32:57.0810096Z     
2025-05-07T20:32:57.0810379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.0810725Z     
2025-05-07T20:32:57.0810932Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.0811228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.0811547Z         x = x_sign * x_clamp
2025-05-07T20:32:57.0811796Z         x0 = x[:, :D]
2025-05-07T20:32:57.0812011Z         x1 = x[:, D:]
2025-05-07T20:32:57.0812221Z     
2025-05-07T20:32:57.0812412Z         if contiguous:
2025-05-07T20:32:57.0812714Z             x0 = x0.contiguous()
2025-05-07T20:32:57.0812984Z             x1 = x1.contiguous()
2025-05-07T20:32:57.0813228Z     
2025-05-07T20:32:57.0813504Z         if scale_ub is not None:
2025-05-07T20:32:57.0813787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.0814125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.0814430Z             )
2025-05-07T20:32:57.0814631Z         else:
2025-05-07T20:32:57.0814854Z             scale_ub_tensor = None
2025-05-07T20:32:57.0815110Z     
2025-05-07T20:32:57.0815350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0815668Z             op = silu_mul_quant
2025-05-07T20:32:57.0815926Z             if compiled:
2025-05-07T20:32:57.0816176Z                 op = torch.compile(op)
2025-05-07T20:32:57.0816475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0816755Z     
2025-05-07T20:32:57.0816951Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.0817123Z 
2025-05-07T20:32:57.0817231Z moe/activation_test.py:117: 
2025-05-07T20:32:57.0817531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0817864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.0818149Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0818714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.0819341Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.0820162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.0820854Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.0821396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.0822080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.0822740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.0823282Z     kernel = self.compile(
2025-05-07T20:32:57.0823834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.0824489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.0824893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0825132Z 
2025-05-07T20:32:57.0825342Z self = <triton.compiler.compiler.ASTSource object at 0x7f31332b9b40>
2025-05-07T20:32:57.0826420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.0827793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133379360>}
2025-05-07T20:32:57.0829188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.0830204Z context = <triton._C.libtriton.ir.context object at 0x7f3133201b30>
2025-05-07T20:32:57.0830545Z 
2025-05-07T20:32:57.0830719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.0831243Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.0831705Z                            module_map=module_map)
2025-05-07T20:32:57.0832075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.0832429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.0832693Z E       ^
2025-05-07T20:32:57.0833160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.0833653Z 
2025-05-07T20:32:57.0834107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.0834626Z 
2025-05-07T20:32:57.0834736Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.0835144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.0835551Z     T=128,
2025-05-07T20:32:57.0835749Z     D=5120,
2025-05-07T20:32:57.0835952Z     scale_ub=1200.0,
2025-05-07T20:32:57.0836177Z     contiguous=False,
2025-05-07T20:32:57.0836409Z     compiled=True,
2025-05-07T20:32:57.0836618Z )
2025-05-07T20:32:57.1986244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.1987299Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.1987600Z 
2025-05-07T20:32:57.1987697Z     @given(
2025-05-07T20:32:57.1987938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.1988250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.1988558Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.1988884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.1989215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.1989494Z     )
2025-05-07T20:32:57.1990076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.1990524Z     def test_silu_mul_quant(
2025-05-07T20:32:57.1990766Z         self,
2025-05-07T20:32:57.1990958Z         T: int,
2025-05-07T20:32:57.1991158Z         D: int,
2025-05-07T20:32:57.1991382Z         scale_ub: Optional[float],
2025-05-07T20:32:57.1991649Z         contiguous: bool,
2025-05-07T20:32:57.1991887Z         compiled: bool,
2025-05-07T20:32:57.1992113Z     ) -> None:
2025-05-07T20:32:57.1992328Z         torch.manual_seed(2025)
2025-05-07T20:32:57.1992571Z     
2025-05-07T20:32:57.1992851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.1993187Z     
2025-05-07T20:32:57.1993390Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.1993684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.1993991Z         x = x_sign * x_clamp
2025-05-07T20:32:57.1994229Z         x0 = x[:, :D]
2025-05-07T20:32:57.1994446Z         x1 = x[:, D:]
2025-05-07T20:32:57.1994651Z     
2025-05-07T20:32:57.1994841Z         if contiguous:
2025-05-07T20:32:57.1995077Z             x0 = x0.contiguous()
2025-05-07T20:32:57.1995333Z             x1 = x1.contiguous()
2025-05-07T20:32:57.1995576Z     
2025-05-07T20:32:57.1995774Z         if scale_ub is not None:
2025-05-07T20:32:57.1996047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.1996375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.1996678Z             )
2025-05-07T20:32:57.1996870Z         else:
2025-05-07T20:32:57.1997079Z             scale_ub_tensor = None
2025-05-07T20:32:57.1997330Z     
2025-05-07T20:32:57.1997595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.1997922Z             op = silu_mul_quant
2025-05-07T20:32:57.1998176Z             if compiled:
2025-05-07T20:32:57.1998425Z                 op = torch.compile(op)
2025-05-07T20:32:57.1998719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.1999064Z     
2025-05-07T20:32:57.1999265Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.1999430Z 
2025-05-07T20:32:57.1999533Z moe/activation_test.py:117: 
2025-05-07T20:32:57.1999827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2000156Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.2000439Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2000988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.2001541Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.2002319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.2003008Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.2003539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.2004214Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.2004874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.2005395Z     kernel = self.compile(
2025-05-07T20:32:57.2005936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.2006586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.2006974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2007204Z 
2025-05-07T20:32:57.2007412Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133258520>
2025-05-07T20:32:57.2008489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.2009890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313337a290>}
2025-05-07T20:32:57.2011239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.2012259Z context = <triton._C.libtriton.ir.context object at 0x7f31331f8570>
2025-05-07T20:32:57.2012547Z 
2025-05-07T20:32:57.2012719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.2013241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.2013701Z                            module_map=module_map)
2025-05-07T20:32:57.2014062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.2014416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.2014672Z E       ^
2025-05-07T20:32:57.2015129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.2015575Z 
2025-05-07T20:32:57.2015986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2016494Z 
2025-05-07T20:32:57.2016597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.2017006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.2017397Z     T=16384,
2025-05-07T20:32:57.2017584Z     D=7168,
2025-05-07T20:32:57.2017781Z     scale_ub=1200.0,
2025-05-07T20:32:57.2018004Z     contiguous=True,
2025-05-07T20:32:57.2018226Z     compiled=True,
2025-05-07T20:32:57.2018426Z )
2025-05-07T20:32:57.2018736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.2019233Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.2019564Z 
2025-05-07T20:32:57.2019640Z     @given(
2025-05-07T20:32:57.2019959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.2020258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.2020561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.2020884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.2021203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.2021486Z     )
2025-05-07T20:32:57.2021834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.2022315Z     def test_silu_mul_quant(
2025-05-07T20:32:57.2022557Z         self,
2025-05-07T20:32:57.2022796Z         T: int,
2025-05-07T20:32:57.2022990Z         D: int,
2025-05-07T20:32:57.2023208Z         scale_ub: Optional[float],
2025-05-07T20:32:57.2023480Z         contiguous: bool,
2025-05-07T20:32:57.2023722Z         compiled: bool,
2025-05-07T20:32:57.2023946Z     ) -> None:
2025-05-07T20:32:57.2024164Z         torch.manual_seed(2025)
2025-05-07T20:32:57.2024407Z     
2025-05-07T20:32:57.2024671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.2025012Z     
2025-05-07T20:32:57.2025204Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.2025491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.2025801Z         x = x_sign * x_clamp
2025-05-07T20:32:57.2026040Z         x0 = x[:, :D]
2025-05-07T20:32:57.2026252Z         x1 = x[:, D:]
2025-05-07T20:32:57.2026459Z     
2025-05-07T20:32:57.2026652Z         if contiguous:
2025-05-07T20:32:57.2026879Z             x0 = x0.contiguous()
2025-05-07T20:32:57.2027138Z             x1 = x1.contiguous()
2025-05-07T20:32:57.2027378Z     
2025-05-07T20:32:57.2027591Z         if scale_ub is not None:
2025-05-07T20:32:57.2027891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.2028226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.2028535Z             )
2025-05-07T20:32:57.2028774Z         else:
2025-05-07T20:32:57.2028988Z             scale_ub_tensor = None
2025-05-07T20:32:57.2029238Z     
2025-05-07T20:32:57.2029462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.2029775Z             op = silu_mul_quant
2025-05-07T20:32:57.2030036Z             if compiled:
2025-05-07T20:32:57.2030285Z                 op = torch.compile(op)
2025-05-07T20:32:57.2030579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2030847Z     
2025-05-07T20:32:57.2031039Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.2031207Z 
2025-05-07T20:32:57.2031308Z moe/activation_test.py:117: 
2025-05-07T20:32:57.2031603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2031926Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.2032205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2032755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.2033316Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.2033963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.2034643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.2035172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.2035842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.2036505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.2037032Z     kernel = self.compile(
2025-05-07T20:32:57.2037567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.2038211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.2038658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2038887Z 
2025-05-07T20:32:57.2039095Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f3f490>
2025-05-07T20:32:57.2040162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.2041507Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313337ad40>}
2025-05-07T20:32:57.2042936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.2043966Z context = <triton._C.libtriton.ir.context object at 0x7f3133117fb0>
2025-05-07T20:32:57.2044253Z 
2025-05-07T20:32:57.2044422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.2044943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.2045401Z                            module_map=module_map)
2025-05-07T20:32:57.2045764Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.2046112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.2046361Z E       ^
2025-05-07T20:32:57.2046825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.2047271Z 
2025-05-07T20:32:57.2047687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2048202Z 
2025-05-07T20:32:57.3420397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.3420919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.3427905Z     T=16384,
2025-05-07T20:32:57.3428142Z     D=5120,
2025-05-07T20:32:57.3428362Z     scale_ub=1200.0,
2025-05-07T20:32:57.3428598Z     contiguous=True,
2025-05-07T20:32:57.3428828Z     compiled=False,
2025-05-07T20:32:57.3429040Z )
2025-05-07T20:32:57.3429368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.3429871Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:57.3430170Z 
2025-05-07T20:32:57.3430251Z     @given(
2025-05-07T20:32:57.3430494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.3430819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.3431123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.3431458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.3431795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.3432082Z     )
2025-05-07T20:32:57.3432439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.3432880Z     def test_silu_mul_quant(
2025-05-07T20:32:57.3433122Z         self,
2025-05-07T20:32:57.3433326Z         T: int,
2025-05-07T20:32:57.3433533Z         D: int,
2025-05-07T20:32:57.3433755Z         scale_ub: Optional[float],
2025-05-07T20:32:57.3434034Z         contiguous: bool,
2025-05-07T20:32:57.3434286Z         compiled: bool,
2025-05-07T20:32:57.3434512Z     ) -> None:
2025-05-07T20:32:57.3434737Z         torch.manual_seed(2025)
2025-05-07T20:32:57.3434984Z     
2025-05-07T20:32:57.3435262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.3435608Z     
2025-05-07T20:32:57.3435809Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.3436107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.3436418Z         x = x_sign * x_clamp
2025-05-07T20:32:57.3436780Z         x0 = x[:, :D]
2025-05-07T20:32:57.3436999Z         x1 = x[:, D:]
2025-05-07T20:32:57.3437215Z     
2025-05-07T20:32:57.3437408Z         if contiguous:
2025-05-07T20:32:57.3437650Z             x0 = x0.contiguous()
2025-05-07T20:32:57.3437909Z             x1 = x1.contiguous()
2025-05-07T20:32:57.3438149Z     
2025-05-07T20:32:57.3438346Z         if scale_ub is not None:
2025-05-07T20:32:57.3438618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.3438956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.3439267Z             )
2025-05-07T20:32:57.3439548Z         else:
2025-05-07T20:32:57.3439766Z             scale_ub_tensor = None
2025-05-07T20:32:57.3440019Z     
2025-05-07T20:32:57.3440309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.3440630Z             op = silu_mul_quant
2025-05-07T20:32:57.3440881Z             if compiled:
2025-05-07T20:32:57.3441135Z                 op = torch.compile(op)
2025-05-07T20:32:57.3441436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3441721Z     
2025-05-07T20:32:57.3441921Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.3442090Z 
2025-05-07T20:32:57.3442191Z moe/activation_test.py:117: 
2025-05-07T20:32:57.3442498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3442831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.3443112Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3443798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.3444490Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.3445039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.3445723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.3446430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.3446967Z     kernel = self.compile(
2025-05-07T20:32:57.3447518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.3448214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.3448610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3448834Z 
2025-05-07T20:32:57.3449049Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f4d390>
2025-05-07T20:32:57.3450126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.3451510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313337bac0>}
2025-05-07T20:32:57.3452860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.3453870Z context = <triton._C.libtriton.ir.context object at 0x7f3132f303f0>
2025-05-07T20:32:57.3454163Z 
2025-05-07T20:32:57.3454331Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.3454849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.3455317Z                            module_map=module_map)
2025-05-07T20:32:57.3455682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.3456039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.3456296Z E       ^
2025-05-07T20:32:57.3456759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.3457255Z 
2025-05-07T20:32:57.3457668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.3458192Z 
2025-05-07T20:32:57.3458299Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.3458715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.3459109Z     T=1,
2025-05-07T20:32:57.3459297Z     D=7168,
2025-05-07T20:32:57.3459495Z     scale_ub=1200.0,
2025-05-07T20:32:57.3459867Z     contiguous=False,
2025-05-07T20:32:57.3460092Z     compiled=False,
2025-05-07T20:32:57.3460294Z )
2025-05-07T20:32:57.3460648Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.3461132Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:57.3461393Z 
2025-05-07T20:32:57.3461471Z     @given(
2025-05-07T20:32:57.3461701Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.3462043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.3462345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.3462674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.3463000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.3463277Z     )
2025-05-07T20:32:57.3463622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.3464061Z     def test_silu_mul_quant(
2025-05-07T20:32:57.3464297Z         self,
2025-05-07T20:32:57.3464486Z         T: int,
2025-05-07T20:32:57.3464681Z         D: int,
2025-05-07T20:32:57.3464897Z         scale_ub: Optional[float],
2025-05-07T20:32:57.3465167Z         contiguous: bool,
2025-05-07T20:32:57.3465407Z         compiled: bool,
2025-05-07T20:32:57.3465624Z     ) -> None:
2025-05-07T20:32:57.3465843Z         torch.manual_seed(2025)
2025-05-07T20:32:57.3466080Z     
2025-05-07T20:32:57.3466395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.3466736Z     
2025-05-07T20:32:57.3466927Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.3467218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.3467519Z         x = x_sign * x_clamp
2025-05-07T20:32:57.3467756Z         x0 = x[:, :D]
2025-05-07T20:32:57.3467971Z         x1 = x[:, D:]
2025-05-07T20:32:57.3468172Z     
2025-05-07T20:32:57.3468356Z         if contiguous:
2025-05-07T20:32:57.3468582Z             x0 = x0.contiguous()
2025-05-07T20:32:57.3468833Z             x1 = x1.contiguous()
2025-05-07T20:32:57.3469071Z     
2025-05-07T20:32:57.3469265Z         if scale_ub is not None:
2025-05-07T20:32:57.3469536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.3469866Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.3470164Z             )
2025-05-07T20:32:57.3470349Z         else:
2025-05-07T20:32:57.3470563Z             scale_ub_tensor = None
2025-05-07T20:32:57.3470817Z     
2025-05-07T20:32:57.3471049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.3471361Z             op = silu_mul_quant
2025-05-07T20:32:57.3471620Z             if compiled:
2025-05-07T20:32:57.3471868Z                 op = torch.compile(op)
2025-05-07T20:32:57.3472159Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3472425Z     
2025-05-07T20:32:57.3472614Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.3472776Z 
2025-05-07T20:32:57.3472874Z moe/activation_test.py:117: 
2025-05-07T20:32:57.3473172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3473505Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.3473778Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3474457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.3475196Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.3475732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.3476409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.3477065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.3477627Z     kernel = self.compile(
2025-05-07T20:32:57.3478168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.3478867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.3479298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3479523Z 
2025-05-07T20:32:57.3479734Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f006a0>
2025-05-07T20:32:57.3480796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.3482151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9c4c0>}
2025-05-07T20:32:57.3483475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.3484508Z context = <triton._C.libtriton.ir.context object at 0x7f3132f27d70>
2025-05-07T20:32:57.3484790Z 
2025-05-07T20:32:57.3484960Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.3485470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.3485987Z                            module_map=module_map)
2025-05-07T20:32:57.3486357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.3486700Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.3486954Z E       ^
2025-05-07T20:32:57.3487416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.3487858Z 
2025-05-07T20:32:57.3488273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.3488780Z 
2025-05-07T20:32:57.5397819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.5398334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.5398740Z     T=4096,
2025-05-07T20:32:57.5398932Z     D=7168,
2025-05-07T20:32:57.5399126Z     scale_ub=1200.0,
2025-05-07T20:32:57.5399353Z     contiguous=False,
2025-05-07T20:32:57.5399578Z     compiled=True,
2025-05-07T20:32:57.5399781Z )
2025-05-07T20:32:57.5400108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.5400600Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.5400869Z 
2025-05-07T20:32:57.5400952Z     @given(
2025-05-07T20:32:57.5401180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.5401498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.5401807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.5402132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.5402462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.5402748Z     )
2025-05-07T20:32:57.5403097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.5403536Z     def test_silu_mul_quant(
2025-05-07T20:32:57.5403782Z         self,
2025-05-07T20:32:57.5403983Z         T: int,
2025-05-07T20:32:57.5404351Z         D: int,
2025-05-07T20:32:57.5404578Z         scale_ub: Optional[float],
2025-05-07T20:32:57.5404848Z         contiguous: bool,
2025-05-07T20:32:57.5405089Z         compiled: bool,
2025-05-07T20:32:57.5405319Z     ) -> None:
2025-05-07T20:32:57.5405537Z         torch.manual_seed(2025)
2025-05-07T20:32:57.5405780Z     
2025-05-07T20:32:57.5406053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.5406400Z     
2025-05-07T20:32:57.5406593Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.5406890Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.5407275Z         x = x_sign * x_clamp
2025-05-07T20:32:57.5407520Z         x0 = x[:, :D]
2025-05-07T20:32:57.5407825Z         x1 = x[:, D:]
2025-05-07T20:32:57.5408045Z     
2025-05-07T20:32:57.5408230Z         if contiguous:
2025-05-07T20:32:57.5408460Z             x0 = x0.contiguous()
2025-05-07T20:32:57.5408720Z             x1 = x1.contiguous()
2025-05-07T20:32:57.5408960Z     
2025-05-07T20:32:57.5409153Z         if scale_ub is not None:
2025-05-07T20:32:57.5409426Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.5409758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.5410062Z             )
2025-05-07T20:32:57.5410261Z         else:
2025-05-07T20:32:57.5410478Z             scale_ub_tensor = None
2025-05-07T20:32:57.5410728Z     
2025-05-07T20:32:57.5410967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.5411278Z             op = silu_mul_quant
2025-05-07T20:32:57.5411525Z             if compiled:
2025-05-07T20:32:57.5411779Z                 op = torch.compile(op)
2025-05-07T20:32:57.5412083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5412356Z     
2025-05-07T20:32:57.5412551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.5412737Z 
2025-05-07T20:32:57.5412840Z moe/activation_test.py:117: 
2025-05-07T20:32:57.5413141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5413541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.5413830Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5414392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.5414957Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.5415609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.5416295Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.5416831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.5417510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.5418167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.5418700Z     kernel = self.compile(
2025-05-07T20:32:57.5419252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.5419980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.5420380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5420612Z 
2025-05-07T20:32:57.5420824Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f02770>
2025-05-07T20:32:57.5421902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.5423280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9d1b0>}
2025-05-07T20:32:57.5424634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.5425724Z context = <triton._C.libtriton.ir.context object at 0x7f3132f14130>
2025-05-07T20:32:57.5426011Z 
2025-05-07T20:32:57.5426184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.5426707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.5427172Z                            module_map=module_map)
2025-05-07T20:32:57.5427589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.5427990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.5428254Z E       ^
2025-05-07T20:32:57.5428725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.5429169Z 
2025-05-07T20:32:57.5429592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.5430114Z 
2025-05-07T20:32:57.5430227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.5430641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.5431053Z     T=128,
2025-05-07T20:32:57.5431250Z     D=7168,
2025-05-07T20:32:57.5431446Z     scale_ub=1200.0,
2025-05-07T20:32:57.5431677Z     contiguous=False,
2025-05-07T20:32:57.5431908Z     compiled=True,
2025-05-07T20:32:57.5432112Z )
2025-05-07T20:32:57.6461381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.6462457Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.6463000Z 
2025-05-07T20:32:57.6463156Z     @given(
2025-05-07T20:32:57.6463623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.6464235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.6465032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.6465697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.6466344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.6466917Z     )
2025-05-07T20:32:57.6467620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.6468118Z     def test_silu_mul_quant(
2025-05-07T20:32:57.6468356Z         self,
2025-05-07T20:32:57.6468552Z         T: int,
2025-05-07T20:32:57.6468752Z         D: int,
2025-05-07T20:32:57.6468976Z         scale_ub: Optional[float],
2025-05-07T20:32:57.6469250Z         contiguous: bool,
2025-05-07T20:32:57.6469495Z         compiled: bool,
2025-05-07T20:32:57.6469724Z     ) -> None:
2025-05-07T20:32:57.6469947Z         torch.manual_seed(2025)
2025-05-07T20:32:57.6470193Z     
2025-05-07T20:32:57.6470466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.6470809Z     
2025-05-07T20:32:57.6471014Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.6471305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.6471620Z         x = x_sign * x_clamp
2025-05-07T20:32:57.6471872Z         x0 = x[:, :D]
2025-05-07T20:32:57.6472090Z         x1 = x[:, D:]
2025-05-07T20:32:57.6472304Z     
2025-05-07T20:32:57.6472491Z         if contiguous:
2025-05-07T20:32:57.6472723Z             x0 = x0.contiguous()
2025-05-07T20:32:57.6472977Z             x1 = x1.contiguous()
2025-05-07T20:32:57.6473213Z     
2025-05-07T20:32:57.6473410Z         if scale_ub is not None:
2025-05-07T20:32:57.6473680Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.6474018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.6474326Z             )
2025-05-07T20:32:57.6474519Z         else:
2025-05-07T20:32:57.6474735Z             scale_ub_tensor = None
2025-05-07T20:32:57.6474987Z     
2025-05-07T20:32:57.6475216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6475612Z             op = silu_mul_quant
2025-05-07T20:32:57.6475864Z             if compiled:
2025-05-07T20:32:57.6476109Z                 op = torch.compile(op)
2025-05-07T20:32:57.6476405Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6476672Z     
2025-05-07T20:32:57.6476862Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.6477030Z 
2025-05-07T20:32:57.6477132Z moe/activation_test.py:117: 
2025-05-07T20:32:57.6477432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6477879Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.6478156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6478768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.6479334Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.6479985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.6480673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.6481206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.6481880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.6482531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.6483057Z     kernel = self.compile(
2025-05-07T20:32:57.6483593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.6484243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.6484640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6484869Z 
2025-05-07T20:32:57.6485074Z self = <triton.compiler.compiler.ASTSource object at 0x7f31336f9b10>
2025-05-07T20:32:57.6486188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.6487559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9c0d0>}
2025-05-07T20:32:57.6488888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.6490075Z context = <triton._C.libtriton.ir.context object at 0x7f31336eb130>
2025-05-07T20:32:57.6490372Z 
2025-05-07T20:32:57.6490540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.6491058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.6491528Z                            module_map=module_map)
2025-05-07T20:32:57.6491893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.6492246Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.6492498Z E       ^
2025-05-07T20:32:57.6492962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.6493411Z 
2025-05-07T20:32:57.6493825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.6494334Z 
2025-05-07T20:32:57.6494445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.6494851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.6495248Z     T=2048,
2025-05-07T20:32:57.6495435Z     D=7168,
2025-05-07T20:32:57.6495623Z     scale_ub=None,
2025-05-07T20:32:57.6495926Z     contiguous=True,
2025-05-07T20:32:57.6496166Z     compiled=True,
2025-05-07T20:32:57.6496373Z )
2025-05-07T20:32:57.6496694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.6497182Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:57.6497445Z 
2025-05-07T20:32:57.6497523Z     @given(
2025-05-07T20:32:57.6497750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.6498061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.6498366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.6498765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.6499150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.6499437Z     )
2025-05-07T20:32:57.6499878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.6500317Z     def test_silu_mul_quant(
2025-05-07T20:32:57.6500558Z         self,
2025-05-07T20:32:57.6500759Z         T: int,
2025-05-07T20:32:57.6500956Z         D: int,
2025-05-07T20:32:57.6501178Z         scale_ub: Optional[float],
2025-05-07T20:32:57.6501449Z         contiguous: bool,
2025-05-07T20:32:57.6501687Z         compiled: bool,
2025-05-07T20:32:57.6501913Z     ) -> None:
2025-05-07T20:32:57.6502136Z         torch.manual_seed(2025)
2025-05-07T20:32:57.6502373Z     
2025-05-07T20:32:57.6502646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.6502988Z     
2025-05-07T20:32:57.6503177Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.6503474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.6503785Z         x = x_sign * x_clamp
2025-05-07T20:32:57.6504031Z         x0 = x[:, :D]
2025-05-07T20:32:57.6504249Z         x1 = x[:, D:]
2025-05-07T20:32:57.6504457Z     
2025-05-07T20:32:57.6504638Z         if contiguous:
2025-05-07T20:32:57.6504874Z             x0 = x0.contiguous()
2025-05-07T20:32:57.6505137Z             x1 = x1.contiguous()
2025-05-07T20:32:57.6505437Z     
2025-05-07T20:32:57.6505645Z         if scale_ub is not None:
2025-05-07T20:32:57.6505926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.6511799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.6512133Z             )
2025-05-07T20:32:57.6512345Z         else:
2025-05-07T20:32:57.6512562Z             scale_ub_tensor = None
2025-05-07T20:32:57.6512835Z     
2025-05-07T20:32:57.6513083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6513404Z             op = silu_mul_quant
2025-05-07T20:32:57.6513686Z             if compiled:
2025-05-07T20:32:57.6513955Z                 op = torch.compile(op)
2025-05-07T20:32:57.6514265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6514551Z     
2025-05-07T20:32:57.6514755Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.6514931Z 
2025-05-07T20:32:57.6515036Z moe/activation_test.py:117: 
2025-05-07T20:32:57.6515356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6515697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.6515988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6516550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.6517118Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.6517788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.6518483Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.6519031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.6519722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.6520394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.6521112Z     kernel = self.compile(
2025-05-07T20:32:57.6521759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.6522554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.6523010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6523287Z 
2025-05-07T20:32:57.6523519Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133661c60>
2025-05-07T20:32:57.6524897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.6526320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9e560>}
2025-05-07T20:32:57.6527693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.6528713Z context = <triton._C.libtriton.ir.context object at 0x7f31336addf0>
2025-05-07T20:32:57.6529007Z 
2025-05-07T20:32:57.6529176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.6529709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.6530193Z                            module_map=module_map)
2025-05-07T20:32:57.6530566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.6530932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.6531199Z E       ^
2025-05-07T20:32:57.6531669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.6532125Z 
2025-05-07T20:32:57.6532589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.6533114Z 
2025-05-07T20:32:57.7326668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7327454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7328172Z     T=16384,
2025-05-07T20:32:57.7328372Z     D=5120,
2025-05-07T20:32:57.7328577Z     scale_ub=None,
2025-05-07T20:32:57.7328809Z     contiguous=False,
2025-05-07T20:32:57.7329044Z     compiled=False,
2025-05-07T20:32:57.7329256Z )
2025-05-07T20:32:57.7329590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7330091Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.7330377Z 
2025-05-07T20:32:57.7330455Z     @given(
2025-05-07T20:32:57.7330699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7331025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7331335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7331673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7332006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7332296Z     )
2025-05-07T20:32:57.7332653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7333097Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7333342Z         self,
2025-05-07T20:32:57.7333548Z         T: int,
2025-05-07T20:32:57.7333752Z         D: int,
2025-05-07T20:32:57.7333974Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7334254Z         contiguous: bool,
2025-05-07T20:32:57.7334504Z         compiled: bool,
2025-05-07T20:32:57.7334728Z     ) -> None:
2025-05-07T20:32:57.7334962Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7335208Z     
2025-05-07T20:32:57.7335490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7335947Z     
2025-05-07T20:32:57.7336146Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7336445Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7338515Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:57.7340534Z 
2025-05-07T20:32:57.7340656Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:57.7340876Z 
2025-05-07T20:32:57.7340983Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7341412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7341814Z     T=4096,
2025-05-07T20:32:57.7342008Z     D=7168,
2025-05-07T20:32:57.7342206Z     scale_ub=1200.0,
2025-05-07T20:32:57.7342433Z     contiguous=True,
2025-05-07T20:32:57.7342667Z     compiled=True,
2025-05-07T20:32:57.7342874Z )
2025-05-07T20:32:57.7343189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7343694Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.7343965Z 
2025-05-07T20:32:57.7344050Z     @given(
2025-05-07T20:32:57.7344280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7344597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7344908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7345242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7345570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7345860Z     )
2025-05-07T20:32:57.7346287Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7346729Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7346973Z         self,
2025-05-07T20:32:57.7347170Z         T: int,
2025-05-07T20:32:57.7347367Z         D: int,
2025-05-07T20:32:57.7347591Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7347876Z         contiguous: bool,
2025-05-07T20:32:57.7348115Z         compiled: bool,
2025-05-07T20:32:57.7348338Z     ) -> None:
2025-05-07T20:32:57.7348567Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7348838Z     
2025-05-07T20:32:57.7349111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7349458Z     
2025-05-07T20:32:57.7349655Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7349951Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7351943Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:57.7353801Z 
2025-05-07T20:32:57.7353925Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:57.7354144Z 
2025-05-07T20:32:57.7354253Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7354673Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7355069Z     T=16384,
2025-05-07T20:32:57.7355268Z     D=7168,
2025-05-07T20:32:57.7355466Z     scale_ub=None,
2025-05-07T20:32:57.7355684Z     contiguous=False,
2025-05-07T20:32:57.7355958Z     compiled=False,
2025-05-07T20:32:57.7356169Z )
2025-05-07T20:32:57.7356487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7356978Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.7357257Z 
2025-05-07T20:32:57.7357337Z     @given(
2025-05-07T20:32:57.7357565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7357875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7358188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7358576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7358902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7359230Z     )
2025-05-07T20:32:57.7359585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7360031Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7360272Z         self,
2025-05-07T20:32:57.7360470Z         T: int,
2025-05-07T20:32:57.7360676Z         D: int,
2025-05-07T20:32:57.7360896Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7361174Z         contiguous: bool,
2025-05-07T20:32:57.7361416Z         compiled: bool,
2025-05-07T20:32:57.7361638Z     ) -> None:
2025-05-07T20:32:57.7361862Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7362105Z     
2025-05-07T20:32:57.7362374Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7364412Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:57.7366349Z 
2025-05-07T20:32:57.7366474Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:57.7366690Z 
2025-05-07T20:32:57.7366795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7367209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7367617Z     T=2048,
2025-05-07T20:32:57.7367836Z     D=7168,
2025-05-07T20:32:57.7368043Z     scale_ub=1200.0,
2025-05-07T20:32:57.7368272Z     contiguous=True,
2025-05-07T20:32:57.7368497Z     compiled=True,
2025-05-07T20:32:57.7368706Z )
2025-05-07T20:32:57.7369020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7369513Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.7369787Z 
2025-05-07T20:32:57.7369866Z     @given(
2025-05-07T20:32:57.7370101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7370407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7370726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7371061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7371389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7371677Z     )
2025-05-07T20:32:57.7372030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7372471Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7372719Z         self,
2025-05-07T20:32:57.7372917Z         T: int,
2025-05-07T20:32:57.7373116Z         D: int,
2025-05-07T20:32:57.7373341Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7373613Z         contiguous: bool,
2025-05-07T20:32:57.7373853Z         compiled: bool,
2025-05-07T20:32:57.7374079Z     ) -> None:
2025-05-07T20:32:57.7374298Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7374539Z     
2025-05-07T20:32:57.7374811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7375197Z     
2025-05-07T20:32:57.7375398Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7375691Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7377707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:57.7379602Z 
2025-05-07T20:32:57.7379727Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:57.7380005Z 
2025-05-07T20:32:57.7380116Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7380532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7380927Z     T=2048,
2025-05-07T20:32:57.7381119Z     D=7168,
2025-05-07T20:32:57.7381315Z     scale_ub=None,
2025-05-07T20:32:57.7381528Z     contiguous=True,
2025-05-07T20:32:57.7381761Z     compiled=False,
2025-05-07T20:32:57.7381965Z )
2025-05-07T20:32:58.0442526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.0443076Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.0443355Z 
2025-05-07T20:32:58.0443441Z     @given(
2025-05-07T20:32:58.0443668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.0443988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.0444292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.0444621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.0444947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.0445236Z     )
2025-05-07T20:32:58.0445694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.0446149Z     def test_silu_mul_quant(
2025-05-07T20:32:58.0446408Z         self,
2025-05-07T20:32:58.0446611Z         T: int,
2025-05-07T20:32:58.0446811Z         D: int,
2025-05-07T20:32:58.0447032Z         scale_ub: Optional[float],
2025-05-07T20:32:58.0447302Z         contiguous: bool,
2025-05-07T20:32:58.0447545Z         compiled: bool,
2025-05-07T20:32:58.0447776Z     ) -> None:
2025-05-07T20:32:58.0447993Z         torch.manual_seed(2025)
2025-05-07T20:32:58.0448241Z     
2025-05-07T20:32:58.0448518Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.0448860Z     
2025-05-07T20:32:58.0449057Z >       x_sign = torch.sign(x)
2025-05-07T20:32:58.0450987Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.0452822Z 
2025-05-07T20:32:58.0452946Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:58.0453160Z 
2025-05-07T20:32:58.0453266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.0453680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.0454090Z     T=1,
2025-05-07T20:32:58.0454281Z     D=7168,
2025-05-07T20:32:58.0454479Z     scale_ub=1200.0,
2025-05-07T20:32:58.0454710Z     contiguous=True,
2025-05-07T20:32:58.0454940Z     compiled=False,
2025-05-07T20:32:58.0455146Z )
2025-05-07T20:32:58.0455466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.0456026Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.0456290Z 
2025-05-07T20:32:58.0456372Z     @given(
2025-05-07T20:32:58.0456605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.0456916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.0457220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.0457550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.0457910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.0458291Z     )
2025-05-07T20:32:58.0458720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.0459159Z     def test_silu_mul_quant(
2025-05-07T20:32:58.0459405Z         self,
2025-05-07T20:32:58.0459607Z         T: int,
2025-05-07T20:32:58.0459886Z         D: int,
2025-05-07T20:32:58.0460118Z         scale_ub: Optional[float],
2025-05-07T20:32:58.0460401Z         contiguous: bool,
2025-05-07T20:32:58.0460642Z         compiled: bool,
2025-05-07T20:32:58.0460871Z     ) -> None:
2025-05-07T20:32:58.0461092Z         torch.manual_seed(2025)
2025-05-07T20:32:58.0461333Z     
2025-05-07T20:32:58.0461612Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.0461955Z     
2025-05-07T20:32:58.0462152Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.0462447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.0462766Z         x = x_sign * x_clamp
2025-05-07T20:32:58.0463013Z         x0 = x[:, :D]
2025-05-07T20:32:58.0463238Z         x1 = x[:, D:]
2025-05-07T20:32:58.0463451Z     
2025-05-07T20:32:58.0463645Z         if contiguous:
2025-05-07T20:32:58.0463878Z             x0 = x0.contiguous()
2025-05-07T20:32:58.0464138Z             x1 = x1.contiguous()
2025-05-07T20:32:58.0464379Z     
2025-05-07T20:32:58.0464572Z         if scale_ub is not None:
2025-05-07T20:32:58.0464844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.0465234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.0465545Z             )
2025-05-07T20:32:58.0465742Z         else:
2025-05-07T20:32:58.0465957Z             scale_ub_tensor = None
2025-05-07T20:32:58.0466203Z     
2025-05-07T20:32:58.0466437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.0466751Z             op = silu_mul_quant
2025-05-07T20:32:58.0467001Z             if compiled:
2025-05-07T20:32:58.0467257Z                 op = torch.compile(op)
2025-05-07T20:32:58.0467563Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.0467848Z     
2025-05-07T20:32:58.0468052Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.0468225Z 
2025-05-07T20:32:58.0468328Z moe/activation_test.py:117: 
2025-05-07T20:32:58.0468624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.0468953Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.0469240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.0469931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.0470618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.0471154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.0471833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.0472491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.0473020Z     kernel = self.compile(
2025-05-07T20:32:58.0473561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.0474213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.0474605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.0474884Z 
2025-05-07T20:32:58.0475094Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132dcce50>
2025-05-07T20:32:58.0476164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.0477535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5c4c0>}
2025-05-07T20:32:58.0478954Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.0479981Z context = <triton._C.libtriton.ir.context object at 0x7f3132d904f0>
2025-05-07T20:32:58.0480274Z 
2025-05-07T20:32:58.0480444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.0480965Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.0481438Z                            module_map=module_map)
2025-05-07T20:32:58.0481807Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.0482167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.0482432Z E       ^
2025-05-07T20:32:58.0482896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.0483350Z 
2025-05-07T20:32:58.0483769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.0484278Z 
2025-05-07T20:32:58.0484385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.0484799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.0485239Z     T=128,
2025-05-07T20:32:58.0485430Z     D=5120,
2025-05-07T20:32:58.0485625Z     scale_ub=None,
2025-05-07T20:32:58.0485838Z     contiguous=True,
2025-05-07T20:32:58.0486076Z     compiled=False,
2025-05-07T20:32:58.0486282Z )
2025-05-07T20:32:58.1262534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1263109Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.1263406Z 
2025-05-07T20:32:58.1263489Z     @given(
2025-05-07T20:32:58.1263729Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1264049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1264358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1264685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1265015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1265298Z     )
2025-05-07T20:32:58.1265645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
﻿2025-05-07T20:32:58.1268639Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1268878Z         self,
2025-05-07T20:32:58.1269067Z         T: int,
2025-05-07T20:32:58.1269263Z         D: int,
2025-05-07T20:32:58.1269483Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1269744Z         contiguous: bool,
2025-05-07T20:32:58.1269978Z         compiled: bool,
2025-05-07T20:32:58.1270201Z     ) -> None:
2025-05-07T20:32:58.1270411Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1270647Z     
2025-05-07T20:32:58.1270917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1271247Z     
2025-05-07T20:32:58.1271433Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1271720Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1272026Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1272256Z         x0 = x[:, :D]
2025-05-07T20:32:58.1272469Z         x1 = x[:, D:]
2025-05-07T20:32:58.1272672Z     
2025-05-07T20:32:58.1272852Z         if contiguous:
2025-05-07T20:32:58.1273100Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1273346Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1273578Z     
2025-05-07T20:32:58.1273766Z         if scale_ub is not None:
2025-05-07T20:32:58.1274025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1274349Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1274647Z             )
2025-05-07T20:32:58.1274831Z         else:
2025-05-07T20:32:58.1275037Z             scale_ub_tensor = None
2025-05-07T20:32:58.1275366Z     
2025-05-07T20:32:58.1275591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1275960Z             op = silu_mul_quant
2025-05-07T20:32:58.1276211Z             if compiled:
2025-05-07T20:32:58.1276448Z                 op = torch.compile(op)
2025-05-07T20:32:58.1276735Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1277001Z     
2025-05-07T20:32:58.1277190Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1277355Z 
2025-05-07T20:32:58.1277456Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1277772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1278120Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1278392Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1279073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1279762Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1280290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1280962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1281611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1282135Z     kernel = self.compile(
2025-05-07T20:32:58.1282725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1283380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1283765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1283990Z 
2025-05-07T20:32:58.1284196Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132ef5780>
2025-05-07T20:32:58.1285258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1286623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5c940>}
2025-05-07T20:32:58.1287944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1289058Z context = <triton._C.libtriton.ir.context object at 0x7f3132e77930>
2025-05-07T20:32:58.1289338Z 
2025-05-07T20:32:58.1289500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1290191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1290658Z                            module_map=module_map)
2025-05-07T20:32:58.1291027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1296930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1297203Z E       ^
2025-05-07T20:32:58.1297678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1298125Z 
2025-05-07T20:32:58.1298560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.1299072Z 
2025-05-07T20:32:58.1299177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.1299586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.1300098Z     T=128,
2025-05-07T20:32:58.1300294Z     D=7168,
2025-05-07T20:32:58.1300489Z     scale_ub=None,
2025-05-07T20:32:58.1300711Z     contiguous=True,
2025-05-07T20:32:58.1300940Z     compiled=False,
2025-05-07T20:32:58.1301264Z )
2025-05-07T20:32:58.1301589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1302145Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.1302417Z 
2025-05-07T20:32:58.1302496Z     @given(
2025-05-07T20:32:58.1302728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1303046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1303351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1303693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1304030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1304317Z     )
2025-05-07T20:32:58.1304670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.1305115Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1305365Z         self,
2025-05-07T20:32:58.1305565Z         T: int,
2025-05-07T20:32:58.1305775Z         D: int,
2025-05-07T20:32:58.1306000Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1306272Z         contiguous: bool,
2025-05-07T20:32:58.1306515Z         compiled: bool,
2025-05-07T20:32:58.1306744Z     ) -> None:
2025-05-07T20:32:58.1306960Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1307227Z     
2025-05-07T20:32:58.1307504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1307894Z     
2025-05-07T20:32:58.1308158Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1308467Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1308778Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1309014Z         x0 = x[:, :D]
2025-05-07T20:32:58.1309236Z         x1 = x[:, D:]
2025-05-07T20:32:58.1309445Z     
2025-05-07T20:32:58.1309633Z         if contiguous:
2025-05-07T20:32:58.1309868Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1310126Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1310366Z     
2025-05-07T20:32:58.1310561Z         if scale_ub is not None:
2025-05-07T20:32:58.1310839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1311174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1311482Z             )
2025-05-07T20:32:58.1311677Z         else:
2025-05-07T20:32:58.1311895Z             scale_ub_tensor = None
2025-05-07T20:32:58.1312145Z     
2025-05-07T20:32:58.1312390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1312708Z             op = silu_mul_quant
2025-05-07T20:32:58.1313054Z             if compiled:
2025-05-07T20:32:58.1313306Z                 op = torch.compile(op)
2025-05-07T20:32:58.1313601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1313868Z     
2025-05-07T20:32:58.1314065Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1314237Z 
2025-05-07T20:32:58.1314340Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1314637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1314967Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1315250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1315942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1316624Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1317160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1317846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1318515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1319039Z     kernel = self.compile(
2025-05-07T20:32:58.1319594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1320245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1320693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1320958Z 
2025-05-07T20:32:58.1321172Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132ef4430>
2025-05-07T20:32:58.1322249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1323624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5d240>}
2025-05-07T20:32:58.1324952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1325963Z context = <triton._C.libtriton.ir.context object at 0x7f3132e67030>
2025-05-07T20:32:58.1326249Z 
2025-05-07T20:32:58.1326423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1326940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1327407Z                            module_map=module_map)
2025-05-07T20:32:58.1327823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1328183Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1328447Z E       ^
2025-05-07T20:32:58.1328918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1329360Z 
2025-05-07T20:32:58.1329777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.1330285Z 
2025-05-07T20:32:58.1330388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.1330801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.1331198Z     T=2048,
2025-05-07T20:32:58.1331390Z     D=7168,
2025-05-07T20:32:58.1331580Z     scale_ub=1200.0,
2025-05-07T20:32:58.1331804Z     contiguous=True,
2025-05-07T20:32:58.1332028Z     compiled=False,
2025-05-07T20:32:58.1332230Z )
2025-05-07T20:32:58.2285453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2285995Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.2286404Z 
2025-05-07T20:32:58.2286496Z     @given(
2025-05-07T20:32:58.2286726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2287044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2287362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2287728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2288405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2288976Z     )
2025-05-07T20:32:58.2289675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2290824Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2291294Z         self,
2025-05-07T20:32:58.2291675Z         T: int,
2025-05-07T20:32:58.2292056Z         D: int,
2025-05-07T20:32:58.2292488Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2293021Z         contiguous: bool,
2025-05-07T20:32:58.2293487Z         compiled: bool,
2025-05-07T20:32:58.2293936Z     ) -> None:
2025-05-07T20:32:58.2294364Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2294820Z     
2025-05-07T20:32:58.2295359Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2298709Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.2300735Z 
2025-05-07T20:32:58.2300856Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.2301065Z 
2025-05-07T20:32:58.2301175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2301583Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2301975Z     T=1,
2025-05-07T20:32:58.2302161Z     D=5120,
2025-05-07T20:32:58.2302345Z     scale_ub=1200.0,
2025-05-07T20:32:58.2302566Z     contiguous=True,
2025-05-07T20:32:58.2302787Z     compiled=False,
2025-05-07T20:32:58.2302989Z )
2025-05-07T20:32:58.2303304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2303789Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.2304046Z 
2025-05-07T20:32:58.2304127Z     @given(
2025-05-07T20:32:58.2304351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2304660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2304963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2305350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2305680Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2305958Z     )
2025-05-07T20:32:58.2306298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2306731Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2306975Z         self,
2025-05-07T20:32:58.2307164Z         T: int,
2025-05-07T20:32:58.2307356Z         D: int,
2025-05-07T20:32:58.2307572Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2307840Z         contiguous: bool,
2025-05-07T20:32:58.2308077Z         compiled: bool,
2025-05-07T20:32:58.2308293Z     ) -> None:
2025-05-07T20:32:58.2308514Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2308746Z     
2025-05-07T20:32:58.2309014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2309346Z     
2025-05-07T20:32:58.2309533Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2309819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2310126Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2310433Z         x0 = x[:, :D]
2025-05-07T20:32:58.2310646Z         x1 = x[:, D:]
2025-05-07T20:32:58.2310854Z     
2025-05-07T20:32:58.2311030Z         if contiguous:
2025-05-07T20:32:58.2311264Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2311522Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2311753Z     
2025-05-07T20:32:58.2311945Z         if scale_ub is not None:
2025-05-07T20:32:58.2312217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2312551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2312847Z             )
2025-05-07T20:32:58.2313035Z         else:
2025-05-07T20:32:58.2313244Z             scale_ub_tensor = None
2025-05-07T20:32:58.2313485Z     
2025-05-07T20:32:58.2313714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2314021Z             op = silu_mul_quant
2025-05-07T20:32:58.2314262Z             if compiled:
2025-05-07T20:32:58.2314508Z                 op = torch.compile(op)
2025-05-07T20:32:58.2314805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2315067Z     
2025-05-07T20:32:58.2315257Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2315418Z 
2025-05-07T20:32:58.2315523Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2315809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2316133Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2316416Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2317196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2317876Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2318456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2319126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2319784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2320308Z     kernel = self.compile(
2025-05-07T20:32:58.2320842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2321494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2321876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2322106Z 
2025-05-07T20:32:58.2322311Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132e2f310>
2025-05-07T20:32:58.2323393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2324830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5e200>}
2025-05-07T20:32:58.2326151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2327158Z context = <triton._C.libtriton.ir.context object at 0x7f31330a1170>
2025-05-07T20:32:58.2327445Z 
2025-05-07T20:32:58.2327615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2328129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2328588Z                            module_map=module_map)
2025-05-07T20:32:58.2328952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2329297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2329557Z E       ^
2025-05-07T20:32:58.2330020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2330521Z 
2025-05-07T20:32:58.2330937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2331446Z 
2025-05-07T20:32:58.2331550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2331958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2332351Z     T=2048,
2025-05-07T20:32:58.2332531Z     D=5120,
2025-05-07T20:32:58.2332727Z     scale_ub=None,
2025-05-07T20:32:58.2332942Z     contiguous=True,
2025-05-07T20:32:58.2333160Z     compiled=False,
2025-05-07T20:32:58.2333364Z )
2025-05-07T20:32:58.2333679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2334166Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.2334431Z 
2025-05-07T20:32:58.2334508Z     @given(
2025-05-07T20:32:58.2334740Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2335051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2335347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2335667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2335995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2336268Z     )
2025-05-07T20:32:58.2336614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2337098Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2337330Z         self,
2025-05-07T20:32:58.2337564Z         T: int,
2025-05-07T20:32:58.2337764Z         D: int,
2025-05-07T20:32:58.2337977Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2338242Z         contiguous: bool,
2025-05-07T20:32:58.2338482Z         compiled: bool,
2025-05-07T20:32:58.2338698Z     ) -> None:
2025-05-07T20:32:58.2338914Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2339155Z     
2025-05-07T20:32:58.2339420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2339878Z     
2025-05-07T20:32:58.2340077Z >       x_sign = torch.sign(x)
2025-05-07T20:32:58.2342000Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.2343860Z 
2025-05-07T20:32:58.2343984Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:58.2344192Z 
2025-05-07T20:32:58.2344405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2344822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2345224Z     T=16384,
2025-05-07T20:32:58.2345409Z     D=5120,
2025-05-07T20:32:58.2345608Z     scale_ub=None,
2025-05-07T20:32:58.2345825Z     contiguous=True,
2025-05-07T20:32:58.2346046Z     compiled=False,
2025-05-07T20:32:58.2346244Z )
2025-05-07T20:32:58.3292055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3292565Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.3292857Z 
2025-05-07T20:32:58.3292945Z     @given(
2025-05-07T20:32:58.3293172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3293485Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3293791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3294120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3294453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3294833Z     )
2025-05-07T20:32:58.3295169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3295608Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3295844Z         self,
2025-05-07T20:32:58.3296034Z         T: int,
2025-05-07T20:32:58.3296224Z         D: int,
2025-05-07T20:32:58.3296444Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3296712Z         contiguous: bool,
2025-05-07T20:32:58.3296942Z         compiled: bool,
2025-05-07T20:32:58.3297160Z     ) -> None:
2025-05-07T20:32:58.3297370Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3297602Z     
2025-05-07T20:32:58.3297870Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3299966Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.3301798Z 
2025-05-07T20:32:58.3301920Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.3302132Z 
2025-05-07T20:32:58.3302319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3302775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3303172Z     T=4096,
2025-05-07T20:32:58.3303354Z     D=5120,
2025-05-07T20:32:58.3303538Z     scale_ub=None,
2025-05-07T20:32:58.3303749Z     contiguous=True,
2025-05-07T20:32:58.3303966Z     compiled=False,
2025-05-07T20:32:58.3304164Z )
2025-05-07T20:32:58.3304477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3304961Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.3305228Z 
2025-05-07T20:32:58.3305301Z     @given(
2025-05-07T20:32:58.3305526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3305829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3306131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3306449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3306774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3307049Z     )
2025-05-07T20:32:58.3307392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3307830Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3308112Z         self,
2025-05-07T20:32:58.3308312Z         T: int,
2025-05-07T20:32:58.3308514Z         D: int,
2025-05-07T20:32:58.3308745Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3309107Z         contiguous: bool,
2025-05-07T20:32:58.3309368Z         compiled: bool,
2025-05-07T20:32:58.3309603Z     ) -> None:
2025-05-07T20:32:58.3309832Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3310084Z     
2025-05-07T20:32:58.3310373Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3312954Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.3315327Z 
2025-05-07T20:32:58.3315455Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.3315693Z 
2025-05-07T20:32:58.3315854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3316315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3316773Z     T=2048,
2025-05-07T20:32:58.3316964Z     D=5120,
2025-05-07T20:32:58.3317158Z     scale_ub=None,
2025-05-07T20:32:58.3317381Z     contiguous=False,
2025-05-07T20:32:58.3317621Z     compiled=False,
2025-05-07T20:32:58.3317834Z )
2025-05-07T20:32:58.3318182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3318750Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.3319067Z 
2025-05-07T20:32:58.3319144Z     @given(
2025-05-07T20:32:58.3319379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3319720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3320056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3320423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3320798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3321114Z     )
2025-05-07T20:32:58.3321504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3322025Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3322284Z         self,
2025-05-07T20:32:58.3322482Z         T: int,
2025-05-07T20:32:58.3322686Z         D: int,
2025-05-07T20:32:58.3322918Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3323261Z         contiguous: bool,
2025-05-07T20:32:58.3323517Z         compiled: bool,
2025-05-07T20:32:58.3323752Z     ) -> None:
2025-05-07T20:32:58.3324014Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3324272Z     
2025-05-07T20:32:58.3324567Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3327153Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.3329576Z 
2025-05-07T20:32:58.3329705Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.3329947Z 
2025-05-07T20:32:58.3330055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3330526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3330992Z     T=4096,
2025-05-07T20:32:58.3331186Z     D=7168,
2025-05-07T20:32:58.3331379Z     scale_ub=None,
2025-05-07T20:32:58.3331604Z     contiguous=True,
2025-05-07T20:32:58.3331839Z     compiled=True,
2025-05-07T20:32:58.3332087Z )
2025-05-07T20:32:58.3332444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3333011Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.3333320Z 
2025-05-07T20:32:58.3333398Z     @given(
2025-05-07T20:32:58.3333637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3333980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3334312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3334679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3335041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3335358Z     )
2025-05-07T20:32:58.3335748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3336256Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3336516Z         self,
2025-05-07T20:32:58.3336713Z         T: int,
2025-05-07T20:32:58.3336916Z         D: int,
2025-05-07T20:32:58.3337148Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3337489Z         contiguous: bool,
2025-05-07T20:32:58.3337744Z         compiled: bool,
2025-05-07T20:32:58.3337975Z     ) -> None:
2025-05-07T20:32:58.3338196Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3338452Z     
2025-05-07T20:32:58.3338744Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3341395Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.3343778Z 
2025-05-07T20:32:58.3343906Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.3344148Z 
2025-05-07T20:32:58.3344258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3344723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3345178Z     T=2048,
2025-05-07T20:32:58.3345366Z     D=5120,
2025-05-07T20:32:58.3345564Z     scale_ub=1200.0,
2025-05-07T20:32:58.3345803Z     contiguous=False,
2025-05-07T20:32:58.3346038Z     compiled=False,
2025-05-07T20:32:58.3346308Z )
2025-05-07T20:32:58.3346659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3347290Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.3347569Z 
2025-05-07T20:32:58.3347644Z     @given(
2025-05-07T20:32:58.3347868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3348177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3348475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3348804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3349126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3349396Z     )
2025-05-07T20:32:58.3349745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3350181Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3350411Z         self,
2025-05-07T20:32:58.3350603Z         T: int,
2025-05-07T20:32:58.3350803Z         D: int,
2025-05-07T20:32:58.3356178Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3356471Z         contiguous: bool,
2025-05-07T20:32:58.3356721Z         compiled: bool,
2025-05-07T20:32:58.3356956Z     ) -> None:
2025-05-07T20:32:58.3357173Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3357427Z     
2025-05-07T20:32:58.3357732Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3359877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.3361762Z 
2025-05-07T20:32:58.3361889Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.3362109Z 
2025-05-07T20:32:58.3362219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3362641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3363047Z     T=4096,
2025-05-07T20:32:58.3363236Z     D=7168,
2025-05-07T20:32:58.3363434Z     scale_ub=1200.0,
2025-05-07T20:32:58.3363670Z     contiguous=True,
2025-05-07T20:32:58.3363896Z     compiled=False,
2025-05-07T20:32:58.3364164Z )
2025-05-07T20:32:58.4621271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4621822Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.4622102Z 
2025-05-07T20:32:58.4622184Z     @given(
2025-05-07T20:32:58.4622425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4622735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4623050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4623386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4623721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4624002Z     )
2025-05-07T20:32:58.4624352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4624800Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4625043Z         self,
2025-05-07T20:32:58.4625248Z         T: int,
2025-05-07T20:32:58.4625453Z         D: int,
2025-05-07T20:32:58.4625677Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4625946Z         contiguous: bool,
2025-05-07T20:32:58.4626190Z         compiled: bool,
2025-05-07T20:32:58.4626418Z     ) -> None:
2025-05-07T20:32:58.4626632Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4626878Z     
2025-05-07T20:32:58.4627151Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4629276Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.4631192Z 
2025-05-07T20:32:58.4631312Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.4631530Z 
2025-05-07T20:32:58.4631634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4632050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4632455Z     T=16384,
2025-05-07T20:32:58.4632648Z     D=7168,
2025-05-07T20:32:58.4632846Z     scale_ub=None,
2025-05-07T20:32:58.4633070Z     contiguous=False,
2025-05-07T20:32:58.4633299Z     compiled=True,
2025-05-07T20:32:58.4633509Z )
2025-05-07T20:32:58.4633832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4634320Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.4634598Z 
2025-05-07T20:32:58.4634674Z     @given(
2025-05-07T20:32:58.4634906Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4635271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4635579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4635905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4636244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4636514Z     )
2025-05-07T20:32:58.4636858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4637293Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4637530Z         self,
2025-05-07T20:32:58.4637724Z         T: int,
2025-05-07T20:32:58.4637925Z         D: int,
2025-05-07T20:32:58.4638140Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4638414Z         contiguous: bool,
2025-05-07T20:32:58.4638660Z         compiled: bool,
2025-05-07T20:32:58.4638876Z     ) -> None:
2025-05-07T20:32:58.4639091Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4639329Z     
2025-05-07T20:32:58.4639592Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4641614Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.4643529Z 
2025-05-07T20:32:58.4643649Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.4643860Z 
2025-05-07T20:32:58.4643965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4644371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4644759Z     T=4096,
2025-05-07T20:32:58.4644946Z     D=7168,
2025-05-07T20:32:58.4645142Z     scale_ub=None,
2025-05-07T20:32:58.4645352Z     contiguous=True,
2025-05-07T20:32:58.4645573Z     compiled=False,
2025-05-07T20:32:58.4645773Z )
2025-05-07T20:32:58.4646081Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4646567Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.4646832Z 
2025-05-07T20:32:58.4646905Z     @given(
2025-05-07T20:32:58.4647129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4647478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4647781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4648170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4648522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4648804Z     )
2025-05-07T20:32:58.4649156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4649595Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4649825Z         self,
2025-05-07T20:32:58.4650021Z         T: int,
2025-05-07T20:32:58.4650217Z         D: int,
2025-05-07T20:32:58.4650433Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4650699Z         contiguous: bool,
2025-05-07T20:32:58.4650934Z         compiled: bool,
2025-05-07T20:32:58.4651148Z     ) -> None:
2025-05-07T20:32:58.4651362Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4651592Z     
2025-05-07T20:32:58.4651856Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4653921Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.4655764Z 
2025-05-07T20:32:58.4655880Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.4656093Z 
2025-05-07T20:32:58.4656197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4656602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4656992Z     T=16384,
2025-05-07T20:32:58.4657182Z     D=7168,
2025-05-07T20:32:58.4657372Z     scale_ub=None,
2025-05-07T20:32:58.4657581Z     contiguous=True,
2025-05-07T20:32:58.4657801Z     compiled=False,
2025-05-07T20:32:58.4658003Z )
2025-05-07T20:32:58.4658308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4658813Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.4659093Z 
2025-05-07T20:32:58.4659167Z     @given(
2025-05-07T20:32:58.4659398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4659835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4660140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4660462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4660788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4661061Z     )
2025-05-07T20:32:58.4661406Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4661846Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4662077Z         self,
2025-05-07T20:32:58.4662270Z         T: int,
2025-05-07T20:32:58.4662467Z         D: int,
2025-05-07T20:32:58.4662683Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4662949Z         contiguous: bool,
2025-05-07T20:32:58.4663185Z         compiled: bool,
2025-05-07T20:32:58.4663402Z     ) -> None:
2025-05-07T20:32:58.4663614Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4663855Z     
2025-05-07T20:32:58.4664120Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4666192Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.4668063Z 
2025-05-07T20:32:58.4668180Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.4668395Z 
2025-05-07T20:32:58.4668498Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4668900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4669289Z     T=16384,
2025-05-07T20:32:58.4669482Z     D=7168,
2025-05-07T20:32:58.4669669Z     scale_ub=1200.0,
2025-05-07T20:32:58.4669883Z     contiguous=True,
2025-05-07T20:32:58.4670102Z     compiled=False,
2025-05-07T20:32:58.4670304Z )
2025-05-07T20:32:58.4670610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4671096Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.4671378Z 
2025-05-07T20:32:58.4671457Z     @given(
2025-05-07T20:32:58.4671684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4671988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4672296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4672619Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4672939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4673216Z     )
2025-05-07T20:32:58.4673604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4674041Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4674275Z         self,
2025-05-07T20:32:58.4674465Z         T: int,
2025-05-07T20:32:58.4674657Z         D: int,
2025-05-07T20:32:58.4674869Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4675134Z         contiguous: bool,
2025-05-07T20:32:58.4675371Z         compiled: bool,
2025-05-07T20:32:58.4675585Z     ) -> None:
2025-05-07T20:32:58.4675797Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4676036Z     
2025-05-07T20:32:58.4676300Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4678319Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.4680240Z 
2025-05-07T20:32:58.4680359Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.4680565Z 
2025-05-07T20:32:58.4680677Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4681082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4681479Z     T=128,
2025-05-07T20:32:58.4681662Z     D=5120,
2025-05-07T20:32:58.4681855Z     scale_ub=1200.0,
2025-05-07T20:32:58.4682076Z     contiguous=False,
2025-05-07T20:32:58.4682296Z     compiled=False,
2025-05-07T20:32:58.4682494Z )
2025-05-07T20:32:58.6093583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6094637Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.6095196Z 
2025-05-07T20:32:58.6095349Z     @given(
2025-05-07T20:32:58.6095812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6096428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6097036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6097687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6098296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6098622Z     )
2025-05-07T20:32:58.6098977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6099557Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6099983Z         self,
2025-05-07T20:32:58.6100185Z         T: int,
2025-05-07T20:32:58.6100388Z         D: int,
2025-05-07T20:32:58.6100612Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6100888Z         contiguous: bool,
2025-05-07T20:32:58.6101130Z         compiled: bool,
2025-05-07T20:32:58.6101354Z     ) -> None:
2025-05-07T20:32:58.6101571Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6101813Z     
2025-05-07T20:32:58.6102089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6102421Z     
2025-05-07T20:32:58.6102617Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6102911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6103214Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6103460Z         x0 = x[:, :D]
2025-05-07T20:32:58.6103678Z         x1 = x[:, D:]
2025-05-07T20:32:58.6103887Z     
2025-05-07T20:32:58.6104076Z         if contiguous:
2025-05-07T20:32:58.6104313Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6104568Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6104809Z     
2025-05-07T20:32:58.6105008Z         if scale_ub is not None:
2025-05-07T20:32:58.6105275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6105611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6105987Z             )
2025-05-07T20:32:58.6106182Z         else:
2025-05-07T20:32:58.6106394Z             scale_ub_tensor = None
2025-05-07T20:32:58.6106652Z     
2025-05-07T20:32:58.6106885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6107193Z             op = silu_mul_quant
2025-05-07T20:32:58.6107445Z             if compiled:
2025-05-07T20:32:58.6107696Z                 op = torch.compile(op)
2025-05-07T20:32:58.6107990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6108267Z     
2025-05-07T20:32:58.6108466Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.6108631Z 
2025-05-07T20:32:58.6108736Z moe/activation_test.py:117: 
2025-05-07T20:32:58.6109038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6109375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.6109657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6110354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.6111121Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.6111657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6112334Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6112996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6113529Z     kernel = self.compile(
2025-05-07T20:32:58.6114073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6114721Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6115126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6115353Z 
2025-05-07T20:32:58.6115573Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132b07940>
2025-05-07T20:32:58.6116646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6118016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133089ea0>}
2025-05-07T20:32:58.6119427Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6120441Z context = <triton._C.libtriton.ir.context object at 0x7f3132c2b070>
2025-05-07T20:32:58.6120728Z 
2025-05-07T20:32:58.6120896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6121411Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6121881Z                            module_map=module_map)
2025-05-07T20:32:58.6122249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6122600Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.6122855Z E       ^
2025-05-07T20:32:58.6123319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6123766Z 
2025-05-07T20:32:58.6124186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.6124701Z 
2025-05-07T20:32:58.6124808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6125221Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6125616Z     T=2048,
2025-05-07T20:32:58.6125806Z     D=7168,
2025-05-07T20:32:58.6125997Z     scale_ub=None,
2025-05-07T20:32:58.6126257Z     contiguous=False,
2025-05-07T20:32:58.6126495Z     compiled=False,
2025-05-07T20:32:58.6126699Z )
2025-05-07T20:32:58.6127021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6127513Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.6127782Z 
2025-05-07T20:32:58.6127857Z     @given(
2025-05-07T20:32:58.6128089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6128403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6128709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6129045Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6129374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6129660Z     )
2025-05-07T20:32:58.6130004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6130441Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6130681Z         self,
2025-05-07T20:32:58.6130926Z         T: int,
2025-05-07T20:32:58.6131126Z         D: int,
2025-05-07T20:32:58.6131351Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6131616Z         contiguous: bool,
2025-05-07T20:32:58.6131858Z         compiled: bool,
2025-05-07T20:32:58.6132083Z     ) -> None:
2025-05-07T20:32:58.6132296Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6132539Z     
2025-05-07T20:32:58.6132807Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6134841Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.6136679Z 
2025-05-07T20:32:58.6136803Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.6137012Z 
2025-05-07T20:32:58.6137117Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6137528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6137944Z     T=128,
2025-05-07T20:32:58.6138158Z     D=7168,
2025-05-07T20:32:58.6138357Z     scale_ub=1200.0,
2025-05-07T20:32:58.6138631Z     contiguous=True,
2025-05-07T20:32:58.6138849Z     compiled=True,
2025-05-07T20:32:58.6139052Z )
2025-05-07T20:32:58.6557713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6558286Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.6558555Z 
2025-05-07T20:32:58.6558633Z     @given(
2025-05-07T20:32:58.6558866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6559169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6559479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6559807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6560133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6560418Z     )
2025-05-07T20:32:58.6560766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6561208Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6561456Z         self,
2025-05-07T20:32:58.6561652Z         T: int,
2025-05-07T20:32:58.6561844Z         D: int,
2025-05-07T20:32:58.6562070Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6562349Z         contiguous: bool,
2025-05-07T20:32:58.6562585Z         compiled: bool,
2025-05-07T20:32:58.6562812Z     ) -> None:
2025-05-07T20:32:58.6563034Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6563273Z     
2025-05-07T20:32:58.6563611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6563954Z     
2025-05-07T20:32:58.6564151Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6564441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6564749Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6564995Z         x0 = x[:, :D]
2025-05-07T20:32:58.6565208Z         x1 = x[:, D:]
2025-05-07T20:32:58.6565418Z     
2025-05-07T20:32:58.6565606Z         if contiguous:
2025-05-07T20:32:58.6565837Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6566098Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6566341Z     
2025-05-07T20:32:58.6566536Z         if scale_ub is not None:
2025-05-07T20:32:58.6566810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6567145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6567445Z             )
2025-05-07T20:32:58.6567643Z         else:
2025-05-07T20:32:58.6567856Z             scale_ub_tensor = None
2025-05-07T20:32:58.6568106Z     
2025-05-07T20:32:58.6568336Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6568730Z             op = silu_mul_quant
2025-05-07T20:32:58.6568982Z             if compiled:
2025-05-07T20:32:58.6569230Z                 op = torch.compile(op)
2025-05-07T20:32:58.6569524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6569794Z     
2025-05-07T20:32:58.6569985Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.6570151Z 
2025-05-07T20:32:58.6570250Z moe/activation_test.py:117: 
2025-05-07T20:32:58.6570546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6570874Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.6571154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6571713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.6572269Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.6572926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.6573615Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.6574147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6574818Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6575475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6576083Z     kernel = self.compile(
2025-05-07T20:32:58.6576664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6577310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6577706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6577934Z 
2025-05-07T20:32:58.6578149Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132ce8940>
2025-05-07T20:32:58.6579221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6580682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313308b7f0>}
2025-05-07T20:32:58.6582022Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6583041Z context = <triton._C.libtriton.ir.context object at 0x7f3132cc3270>
2025-05-07T20:32:58.6583323Z 
2025-05-07T20:32:58.6583544Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6584065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6584529Z                            module_map=module_map)
2025-05-07T20:32:58.6584897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6585249Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.6585505Z E       ^
2025-05-07T20:32:58.6585970Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6586423Z 
2025-05-07T20:32:58.6586846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.6587354Z 
2025-05-07T20:32:58.6587464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6587873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6594722Z     T=128,
2025-05-07T20:32:58.6594941Z     D=7168,
2025-05-07T20:32:58.6595144Z     scale_ub=1200.0,
2025-05-07T20:32:58.6595501Z     contiguous=True,
2025-05-07T20:32:58.6595737Z     compiled=False,
2025-05-07T20:32:58.6595959Z )
2025-05-07T20:32:58.6596286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6596791Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.6597068Z 
2025-05-07T20:32:58.6597158Z     @given(
2025-05-07T20:32:58.6597397Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6597727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6598051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6598389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6598724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6599016Z     )
2025-05-07T20:32:58.6599376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6599820Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6600072Z         self,
2025-05-07T20:32:58.6600281Z         T: int,
2025-05-07T20:32:58.6600480Z         D: int,
2025-05-07T20:32:58.6600709Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6600987Z         contiguous: bool,
2025-05-07T20:32:58.6601225Z         compiled: bool,
2025-05-07T20:32:58.6601457Z     ) -> None:
2025-05-07T20:32:58.6601678Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6601920Z     
2025-05-07T20:32:58.6602200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6602626Z     
2025-05-07T20:32:58.6602825Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6603195Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6605198Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.6607048Z 
2025-05-07T20:32:58.6607176Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.6607395Z 
2025-05-07T20:32:58.6607508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6607923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6608364Z     T=128,
2025-05-07T20:32:58.6608585Z     D=5120,
2025-05-07T20:32:58.6608778Z     scale_ub=1200.0,
2025-05-07T20:32:58.6609007Z     contiguous=True,
2025-05-07T20:32:58.6609244Z     compiled=True,
2025-05-07T20:32:58.6609448Z )
2025-05-07T20:32:58.6609835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6610323Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.6610593Z 
2025-05-07T20:32:58.6610683Z     @given(
2025-05-07T20:32:58.6610913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6611231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6611546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6611874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6612214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6612504Z     )
2025-05-07T20:32:58.6612855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6613300Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6613545Z         self,
2025-05-07T20:32:58.6613747Z         T: int,
2025-05-07T20:32:58.6613945Z         D: int,
2025-05-07T20:32:58.6614169Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6614447Z         contiguous: bool,
2025-05-07T20:32:58.6614691Z         compiled: bool,
2025-05-07T20:32:58.6614980Z     ) -> None:
2025-05-07T20:32:58.6615202Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6615441Z     
2025-05-07T20:32:58.6615733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6616074Z     
2025-05-07T20:32:58.6616270Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6616564Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6618547Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.6620466Z 
2025-05-07T20:32:58.6620590Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.6620802Z 
2025-05-07T20:32:58.6620912Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6621322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6621732Z     T=128,
2025-05-07T20:32:58.6621924Z     D=7168,
2025-05-07T20:32:58.6622122Z     scale_ub=None,
2025-05-07T20:32:58.6622335Z     contiguous=True,
2025-05-07T20:32:58.6622614Z     compiled=True,
2025-05-07T20:32:58.6622822Z )
2025-05-07T20:32:58.8671756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8672284Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.8672544Z 
2025-05-07T20:32:58.8672625Z     @given(
2025-05-07T20:32:58.8672858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8673164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8673480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8673809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8674130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8674419Z     )
2025-05-07T20:32:58.8674770Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8675206Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8675447Z         self,
2025-05-07T20:32:58.8675644Z         T: int,
2025-05-07T20:32:58.8675847Z         D: int,
2025-05-07T20:32:58.8676061Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8676337Z         contiguous: bool,
2025-05-07T20:32:58.8676579Z         compiled: bool,
2025-05-07T20:32:58.8676803Z     ) -> None:
2025-05-07T20:32:58.8677020Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8677264Z     
2025-05-07T20:32:58.8677598Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8679620Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.8681458Z 
2025-05-07T20:32:58.8681580Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.8681791Z 
2025-05-07T20:32:58.8693361Z FAILED
2025-05-07T20:32:58.8693583Z 
2025-05-07T20:32:58.8693849Z =================================== FAILURES ===================================
2025-05-07T20:32:58.8694490Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:58.8695113Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:58.8696195Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:58.8696954Z   |     yield
2025-05-07T20:32:58.8697539Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:58.8698493Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:58.8699270Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:58.8700108Z   |     method()
2025-05-07T20:32:58.8700995Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:58.8702192Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8703085Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:58.8703967Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:58.8704657Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:58.8705327Z   +-+---------------- 1 ----------------
2025-05-07T20:32:58.8705725Z     | Traceback (most recent call last):
2025-05-07T20:32:58.8706714Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:58.8707996Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8710868Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.8713614Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:58.8714224Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8714778Z     |     T=2048,
2025-05-07T20:32:58.8715092Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:58.8715562Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:58.8716059Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:58.8716598Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:58.8717014Z     | )
2025-05-07T20:32:58.8717265Z     | 
2025-05-07T20:32:58.8718127Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:58.8718960Z     +---------------- 2 ----------------
2025-05-07T20:32:58.8719377Z     | Traceback (most recent call last):
2025-05-07T20:32:58.8720390Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:58.8721454Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8724270Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.8727041Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:58.8727650Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8728222Z     |     T=128,
2025-05-07T20:32:58.8728504Z     |     D=7168,
2025-05-07T20:32:58.8728779Z     |     scale_ub=None,
2025-05-07T20:32:58.8729107Z     |     contiguous=True,
2025-05-07T20:32:58.8729475Z     |     compiled=True,
2025-05-07T20:32:58.8729790Z     | )
2025-05-07T20:32:58.8730035Z     | 
2025-05-07T20:32:58.8730766Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:58.8731603Z     +---------------- 3 ----------------
2025-05-07T20:32:58.8732001Z     | Traceback (most recent call last):
2025-05-07T20:32:58.8732850Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:58.8733615Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8735679Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.8737653Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:58.8738085Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8738661Z     |     T=128,
2025-05-07T20:32:58.8738944Z     |     D=5120,
2025-05-07T20:32:58.8739239Z     |     scale_ub=1200.0,
2025-05-07T20:32:58.8739585Z     |     contiguous=True,
2025-05-07T20:32:58.8740038Z     |     compiled=True,
2025-05-07T20:32:58.8741931Z     | )
2025-05-07T20:32:58.8742197Z     | 
2025-05-07T20:32:58.8742937Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:58.8743792Z     +---------------- 4 ----------------
2025-05-07T20:32:58.8744217Z     | Traceback (most recent call last):
2025-05-07T20:32:58.8745260Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:58.8746314Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.8747294Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:58.8748028Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8749220Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:58.8750361Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.8751238Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:58.8752312Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8755158Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:58.8756266Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8757439Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:58.8758785Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8759888Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:58.8760840Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.8761759Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:58.8762554Z     |     fn()
2025-05-07T20:32:58.8763348Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:58.8764222Z     |     self.fn.run(
2025-05-07T20:32:58.8764954Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:58.8765756Z     |     kernel = self.compile(
2025-05-07T20:32:58.8766603Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:58.8767577Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8768561Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:58.8769731Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8770548Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8773987Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.8774382Z     | ^
2025-05-07T20:32:58.8775019Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8775812Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:58.8776361Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:58.8777071Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8777658Z     |     T=1,  # or any other generated value
2025-05-07T20:32:58.8778092Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:58.8778568Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:58.8779067Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:58.8779627Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:58.8780207Z     | )
2025-05-07T20:32:58.8780461Z     | 
2025-05-07T20:32:58.8781178Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:58.8782016Z     +------------------------------------
2025-05-07T20:32:58.8782593Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:58.8783121Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8783685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8784242Z     T=1,
2025-05-07T20:32:58.8784493Z     D=5120,
2025-05-07T20:32:58.8784759Z     scale_ub=None,
2025-05-07T20:32:58.8785056Z     contiguous=True,
2025-05-07T20:32:58.8785354Z     compiled=True,
2025-05-07T20:32:58.8785644Z )
2025-05-07T20:32:58.8786087Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8786761Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.8787122Z 
2025-05-07T20:32:58.8787229Z     @given(
2025-05-07T20:32:58.8787548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8787981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8788447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8788919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8789460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8790136Z     )
2025-05-07T20:32:58.8790636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8791235Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8791561Z         self,
2025-05-07T20:32:58.8791818Z         T: int,
2025-05-07T20:32:58.8792091Z         D: int,
2025-05-07T20:32:58.8792400Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8792768Z         contiguous: bool,
2025-05-07T20:32:58.8793109Z         compiled: bool,
2025-05-07T20:32:58.8793429Z     ) -> None:
2025-05-07T20:32:58.8793724Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8794067Z     
2025-05-07T20:32:58.8794445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8794900Z     
2025-05-07T20:32:58.8795160Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8795558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8795977Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8796318Z         x0 = x[:, :D]
2025-05-07T20:32:58.8796619Z         x1 = x[:, D:]
2025-05-07T20:32:58.8796903Z     
2025-05-07T20:32:58.8797159Z         if contiguous:
2025-05-07T20:32:58.8797481Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8797835Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8798171Z     
2025-05-07T20:32:58.8798438Z         if scale_ub is not None:
2025-05-07T20:32:58.8799009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8799526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8799943Z             )
2025-05-07T20:32:58.8800205Z         else:
2025-05-07T20:32:58.8800482Z             scale_ub_tensor = None
2025-05-07T20:32:58.8800829Z     
2025-05-07T20:32:58.8801141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8801554Z             op = silu_mul_quant
2025-05-07T20:32:58.8801878Z             if compiled:
2025-05-07T20:32:58.8802201Z                 op = torch.compile(op)
2025-05-07T20:32:58.8802575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8802924Z     
2025-05-07T20:32:58.8803177Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.8803541Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.8803911Z     
2025-05-07T20:32:58.8804217Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8804653Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.8805020Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.8805425Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.8805879Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8806268Z     
2025-05-07T20:32:58.8806522Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.8806774Z 
2025-05-07T20:32:58.8807048Z moe/activation_test.py:126: 
2025-05-07T20:32:58.8807431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8807861Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.8808280Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8809290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.8810274Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.8811031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8811963Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8812905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.8813881Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8814901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.8816003Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8816991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.8817848Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.8818652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.8819327Z     fn()
2025-05-07T20:32:58.8820087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.8820841Z     self.fn.run(
2025-05-07T20:32:58.8821454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8822141Z     kernel = self.compile(
2025-05-07T20:32:58.8822838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8823678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8824182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8824475Z 
2025-05-07T20:32:58.8824736Z self = <triton.compiler.compiler.ASTSource object at 0x7f32571b7ee0>
2025-05-07T20:32:58.8826271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8828162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32572a4af0>}
2025-05-07T20:32:58.8829946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8831291Z context = <triton._C.libtriton.ir.context object at 0x7f325d1c6930>
2025-05-07T20:32:58.8831660Z 
2025-05-07T20:32:58.8831877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8832553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8833175Z                            module_map=module_map)
2025-05-07T20:32:58.8833656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8834142Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.8834504Z E       ^
2025-05-07T20:32:58.8835173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8835803Z 
2025-05-07T20:32:58.8836376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8837089Z 
2025-05-07T20:32:58.8837227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8837785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8838324Z     T=2048,
2025-05-07T20:32:58.8838570Z     D=5120,
2025-05-07T20:32:58.8838819Z     scale_ub=1200.0,
2025-05-07T20:32:58.8839107Z     contiguous=True,
2025-05-07T20:32:58.8839412Z     compiled=False,
2025-05-07T20:32:58.8839686Z )
2025-05-07T20:32:58.8840109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8840762Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.8841138Z 
2025-05-07T20:32:58.8841237Z     @given(
2025-05-07T20:32:58.8841532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8841962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8842453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8842928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8843361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8843740Z     )
2025-05-07T20:32:58.8844203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8844807Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8845138Z         self,
2025-05-07T20:32:58.8845399Z         T: int,
2025-05-07T20:32:58.8865876Z         D: int,
2025-05-07T20:32:58.8866174Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8866525Z         contiguous: bool,
2025-05-07T20:32:58.8866831Z         compiled: bool,
2025-05-07T20:32:58.8867122Z     ) -> None:
2025-05-07T20:32:58.8867392Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8867706Z     
2025-05-07T20:32:58.8868074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8868538Z     
2025-05-07T20:32:58.8868792Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8869179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8869592Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8869904Z         x0 = x[:, :D]
2025-05-07T20:32:58.8870194Z         x1 = x[:, D:]
2025-05-07T20:32:58.8870470Z     
2025-05-07T20:32:58.8870711Z         if contiguous:
2025-05-07T20:32:58.8871018Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8871471Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8871782Z     
2025-05-07T20:32:58.8872046Z         if scale_ub is not None:
2025-05-07T20:32:58.8872476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8872917Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8873335Z             )
2025-05-07T20:32:58.8873597Z         else:
2025-05-07T20:32:58.8873882Z             scale_ub_tensor = None
2025-05-07T20:32:58.8874207Z     
2025-05-07T20:32:58.8874522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8874945Z             op = silu_mul_quant
2025-05-07T20:32:58.8875273Z             if compiled:
2025-05-07T20:32:58.8875602Z                 op = torch.compile(op)
2025-05-07T20:32:58.8876000Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8876369Z     
2025-05-07T20:32:58.8876623Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.8876839Z 
2025-05-07T20:32:58.8876983Z moe/activation_test.py:117: 
2025-05-07T20:32:58.8877380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8877835Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.8878219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8879178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.8880090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.8880867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8881792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8882694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8883427Z     kernel = self.compile(
2025-05-07T20:32:58.8884175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8885098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8885648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8885969Z 
2025-05-07T20:32:58.8886240Z self = <triton.compiler.compiler.ASTSource object at 0x7f3257089960>
2025-05-07T20:32:58.8887702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8889711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3257181990>}
2025-05-07T20:32:58.8891831Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8893264Z context = <triton._C.libtriton.ir.context object at 0x7f3257889870>
2025-05-07T20:32:58.8893647Z 
2025-05-07T20:32:58.8893871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8894565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8895194Z                            module_map=module_map)
2025-05-07T20:32:58.8895679Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8896162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.8896479Z E       ^
2025-05-07T20:32:58.8897112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8897729Z 
2025-05-07T20:32:58.8898293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8899142Z 
2025-05-07T20:32:58.8899290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8900043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8900593Z     T=2048,
2025-05-07T20:32:58.8900849Z     D=5120,
2025-05-07T20:32:58.8901106Z     scale_ub=1200.0,
2025-05-07T20:32:58.8901400Z     contiguous=True,
2025-05-07T20:32:58.8901703Z     compiled=True,
2025-05-07T20:32:58.8901990Z )
2025-05-07T20:32:58.8902421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8903092Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.8903452Z 
2025-05-07T20:32:58.8903559Z     @given(
2025-05-07T20:32:58.8903881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8904302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8904719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8905181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8905654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8906045Z     )
2025-05-07T20:32:58.8906537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8906989Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8907225Z         self,
2025-05-07T20:32:58.8907416Z         T: int,
2025-05-07T20:32:58.8907611Z         D: int,
2025-05-07T20:32:58.8907917Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8908182Z         contiguous: bool,
2025-05-07T20:32:58.8908431Z         compiled: bool,
2025-05-07T20:32:58.8908654Z     ) -> None:
2025-05-07T20:32:58.8908862Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8909105Z     
2025-05-07T20:32:58.8909379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8909713Z     
2025-05-07T20:32:58.8909909Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8910199Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8910505Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8910744Z         x0 = x[:, :D]
2025-05-07T20:32:58.8910958Z         x1 = x[:, D:]
2025-05-07T20:32:58.8911157Z     
2025-05-07T20:32:58.8911341Z         if contiguous:
2025-05-07T20:32:58.8911571Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8911816Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8912053Z     
2025-05-07T20:32:58.8912243Z         if scale_ub is not None:
2025-05-07T20:32:58.8912506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8912916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8913218Z             )
2025-05-07T20:32:58.8913409Z         else:
2025-05-07T20:32:58.8913614Z             scale_ub_tensor = None
2025-05-07T20:32:58.8913860Z     
2025-05-07T20:32:58.8914087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8914386Z             op = silu_mul_quant
2025-05-07T20:32:58.8914636Z             if compiled:
2025-05-07T20:32:58.8914886Z                 op = torch.compile(op)
2025-05-07T20:32:58.8915177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8915443Z     
2025-05-07T20:32:58.8915636Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.8915912Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.8916201Z     
2025-05-07T20:32:58.8916434Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8916760Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.8917052Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.8917361Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.8917722Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8918019Z     
2025-05-07T20:32:58.8918218Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.8918410Z 
2025-05-07T20:32:58.8918513Z moe/activation_test.py:126: 
2025-05-07T20:32:58.8918799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8919183Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.8919547Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8920330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.8921064Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.8921608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8922285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8922958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.8923673Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8924421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.8925166Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8925884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.8926516Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.8927152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.8927672Z     fn()
2025-05-07T20:32:58.8928171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.8928750Z     self.fn.run(
2025-05-07T20:32:58.8929210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8929790Z     kernel = self.compile(
2025-05-07T20:32:58.8930334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8931048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8931441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8931726Z 
2025-05-07T20:32:58.8931934Z self = <triton.compiler.compiler.ASTSource object at 0x7f32571b5a20>
2025-05-07T20:32:58.8933008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8934470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255c1d3f0>}
2025-05-07T20:32:58.8935928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8936955Z context = <triton._C.libtriton.ir.context object at 0x7f3255b2cc30>
2025-05-07T20:32:58.8937242Z 
2025-05-07T20:32:58.8937405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8937935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8938400Z                            module_map=module_map)
2025-05-07T20:32:58.8938763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8939116Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.8939379Z E       ^
2025-05-07T20:32:58.8939954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8940401Z 
2025-05-07T20:32:58.8940812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8941382Z 
2025-05-07T20:32:58.8941542Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8941955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8942344Z     T=16384,
2025-05-07T20:32:58.8942535Z     D=7168,
2025-05-07T20:32:58.8942725Z     scale_ub=1200.0,
2025-05-07T20:32:58.8942948Z     contiguous=False,
2025-05-07T20:32:58.8943175Z     compiled=False,
2025-05-07T20:32:58.8943382Z )
2025-05-07T20:32:58.8943701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8944191Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.8944478Z 
2025-05-07T20:32:58.8944554Z     @given(
2025-05-07T20:32:58.8944784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8945085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8945393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8945721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8946042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8946322Z     )
2025-05-07T20:32:58.8947040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8947485Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8947717Z         self,
2025-05-07T20:32:58.8947995Z         T: int,
2025-05-07T20:32:58.8948196Z         D: int,
2025-05-07T20:32:58.8948408Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8948677Z         contiguous: bool,
2025-05-07T20:32:58.8948913Z         compiled: bool,
2025-05-07T20:32:58.8949129Z     ) -> None:
2025-05-07T20:32:58.8949346Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8949596Z     
2025-05-07T20:32:58.8949866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8950197Z     
2025-05-07T20:32:58.8950398Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8950690Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8950997Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8951238Z         x0 = x[:, :D]
2025-05-07T20:32:58.8951460Z         x1 = x[:, D:]
2025-05-07T20:32:58.8951659Z     
2025-05-07T20:32:58.8951845Z         if contiguous:
2025-05-07T20:32:58.8952079Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8952334Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8952575Z     
2025-05-07T20:32:58.8952821Z         if scale_ub is not None:
2025-05-07T20:32:58.8953087Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8953421Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8953727Z             )
2025-05-07T20:32:58.8953913Z         else:
2025-05-07T20:32:58.8954124Z             scale_ub_tensor = None
2025-05-07T20:32:58.8954372Z     
2025-05-07T20:32:58.8954601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8954905Z             op = silu_mul_quant
2025-05-07T20:32:58.8955153Z             if compiled:
2025-05-07T20:32:58.8955401Z                 op = torch.compile(op)
2025-05-07T20:32:58.8955686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8955956Z     
2025-05-07T20:32:58.8956145Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.8956306Z 
2025-05-07T20:32:58.8956404Z moe/activation_test.py:117: 
2025-05-07T20:32:58.8956699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8957030Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.8957303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8958012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.8958720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.8959253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8960018Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8960676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8961206Z     kernel = self.compile(
2025-05-07T20:32:58.8961742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8962381Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8962774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8962996Z 
2025-05-07T20:32:58.8963212Z self = <triton.compiler.compiler.ASTSource object at 0x7f3255f11270>
2025-05-07T20:32:58.8964269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8965639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255c1ce50>}
2025-05-07T20:32:58.8967006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8968034Z context = <triton._C.libtriton.ir.context object at 0x7f3255baa370>
2025-05-07T20:32:58.8968314Z 
2025-05-07T20:32:58.8968485Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8968993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8969459Z                            module_map=module_map)
2025-05-07T20:32:58.8969825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8970180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.8970435Z E       ^
2025-05-07T20:32:58.8970899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8971339Z 
2025-05-07T20:32:58.8971757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8972268Z 
2025-05-07T20:32:58.8972372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8972826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8973225Z     T=1,
2025-05-07T20:32:58.8973410Z     D=7168,
2025-05-07T20:32:58.8973595Z     scale_ub=None,
2025-05-07T20:32:58.8973807Z     contiguous=True,
2025-05-07T20:32:58.8974027Z     compiled=True,
2025-05-07T20:32:58.8974221Z )
2025-05-07T20:32:58.8974538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8975021Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.8975276Z 
2025-05-07T20:32:58.8975354Z     @given(
2025-05-07T20:32:58.8975583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8975889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8976182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8976511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8976839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8977125Z     )
2025-05-07T20:32:58.8977464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8977896Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8978132Z         self,
2025-05-07T20:32:58.8978319Z         T: int,
2025-05-07T20:32:58.8978522Z         D: int,
2025-05-07T20:32:58.8978775Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8979038Z         contiguous: bool,
2025-05-07T20:32:58.8979323Z         compiled: bool,
2025-05-07T20:32:58.8979545Z     ) -> None:
2025-05-07T20:32:58.8979922Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8980173Z     
2025-05-07T20:32:58.8980444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8980774Z     
2025-05-07T20:32:58.8980975Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8981268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8981571Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8981815Z         x0 = x[:, :D]
2025-05-07T20:32:58.8982030Z         x1 = x[:, D:]
2025-05-07T20:32:58.8982237Z     
2025-05-07T20:32:58.8982417Z         if contiguous:
2025-05-07T20:32:58.8982654Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8982911Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8983143Z     
2025-05-07T20:32:58.8983339Z         if scale_ub is not None:
2025-05-07T20:32:58.8983610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8983941Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8984246Z             )
2025-05-07T20:32:58.8984440Z         else:
2025-05-07T20:32:58.8984647Z             scale_ub_tensor = None
2025-05-07T20:32:58.8984905Z     
2025-05-07T20:32:58.8985136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8985445Z             op = silu_mul_quant
2025-05-07T20:32:58.8985695Z             if compiled:
2025-05-07T20:32:58.8985999Z                 op = torch.compile(op)
2025-05-07T20:32:58.8986295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8986567Z     
2025-05-07T20:32:58.8986757Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.8987047Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.8987331Z     
2025-05-07T20:32:58.8987579Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8987910Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.8988194Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.8988509Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.8988871Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8989171Z     
2025-05-07T20:32:58.8989374Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.8989566Z 
2025-05-07T20:32:58.8989670Z moe/activation_test.py:126: 
2025-05-07T20:32:58.8990292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8990633Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.8991055Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8991836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.8992575Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.8993117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8993793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8994474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.8995183Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8995930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.8996674Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8997402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.8998080Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.8998683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.8999273Z     fn()
2025-05-07T20:32:58.8999829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9000401Z     self.fn.run(
2025-05-07T20:32:58.9000875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9001407Z     kernel = self.compile(
2025-05-07T20:32:58.9001939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9002596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9002987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9003212Z 
2025-05-07T20:32:58.9003419Z self = <triton.compiler.compiler.ASTSource object at 0x7f3255c4d1e0>
2025-05-07T20:32:58.9004489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9005866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559b8c10>}
2025-05-07T20:32:58.9007258Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9008293Z context = <triton._C.libtriton.ir.context object at 0x7f3255aedb30>
2025-05-07T20:32:58.9008581Z 
2025-05-07T20:32:58.9008746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9009264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9009736Z                            module_map=module_map)
2025-05-07T20:32:58.9010099Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9010451Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9010715Z E       ^
2025-05-07T20:32:58.9011172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9011613Z 
2025-05-07T20:32:58.9012026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9012582Z 
2025-05-07T20:32:58.9012687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9013093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9013492Z     T=4096,
2025-05-07T20:32:58.9013673Z     D=5120,
2025-05-07T20:32:58.9013866Z     scale_ub=None,
2025-05-07T20:32:58.9014081Z     contiguous=False,
2025-05-07T20:32:58.9014300Z     compiled=False,
2025-05-07T20:32:58.9014512Z )
2025-05-07T20:32:58.9014828Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9015314Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9016047Z 
2025-05-07T20:32:58.9016163Z     @given(
2025-05-07T20:32:58.9016437Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9016746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9017063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9017396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9017765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9018082Z     )
2025-05-07T20:32:58.9018502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9018967Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9019213Z         self,
2025-05-07T20:32:58.9019404Z         T: int,
2025-05-07T20:32:58.9019607Z         D: int,
2025-05-07T20:32:58.9020001Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9020279Z         contiguous: bool,
2025-05-07T20:32:58.9020584Z         compiled: bool,
2025-05-07T20:32:58.9020826Z     ) -> None:
2025-05-07T20:32:58.9021048Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9021290Z     
2025-05-07T20:32:58.9021570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9021918Z     
2025-05-07T20:32:58.9022112Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9022416Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9022737Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9034090Z         x0 = x[:, :D]
2025-05-07T20:32:58.9034363Z         x1 = x[:, D:]
2025-05-07T20:32:58.9034643Z     
2025-05-07T20:32:58.9034836Z         if contiguous:
2025-05-07T20:32:58.9035068Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9035341Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9035594Z     
2025-05-07T20:32:58.9035796Z         if scale_ub is not None:
2025-05-07T20:32:58.9036080Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9036434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9036752Z             )
2025-05-07T20:32:58.9036957Z         else:
2025-05-07T20:32:58.9037182Z             scale_ub_tensor = None
2025-05-07T20:32:58.9037434Z     
2025-05-07T20:32:58.9037680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9038130Z             op = silu_mul_quant
2025-05-07T20:32:58.9038411Z             if compiled:
2025-05-07T20:32:58.9038660Z                 op = torch.compile(op)
2025-05-07T20:32:58.9038967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9039245Z     
2025-05-07T20:32:58.9039439Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9039614Z 
2025-05-07T20:32:58.9039724Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9040030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9040366Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9040651Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9041356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9042048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9042590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9043280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9044605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9045134Z     kernel = self.compile(
2025-05-07T20:32:58.9045683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9046343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9046748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9046983Z 
2025-05-07T20:32:58.9047192Z self = <triton.compiler.compiler.ASTSource object at 0x7f325729eb30>
2025-05-07T20:32:58.9047979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9048482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559b9a20>}
2025-05-07T20:32:58.9049230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9049472Z context = <triton._C.libtriton.ir.context object at 0x7f32558feff0>
2025-05-07T20:32:58.9049478Z 
2025-05-07T20:32:58.9049689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9049959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9050070Z                            module_map=module_map)
2025-05-07T20:32:58.9050240Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9050346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9050429Z E       ^
2025-05-07T20:32:58.9050791Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9050796Z 
2025-05-07T20:32:58.9051208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9051213Z 
2025-05-07T20:32:58.9051328Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9051553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9051632Z     T=4096,
2025-05-07T20:32:58.9051721Z     D=7168,
2025-05-07T20:32:58.9051806Z     scale_ub=None,
2025-05-07T20:32:58.9051895Z     contiguous=False,
2025-05-07T20:32:58.9051989Z     compiled=False,
2025-05-07T20:32:58.9052067Z )
2025-05-07T20:32:58.9052284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9052511Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9052518Z 
2025-05-07T20:32:58.9052597Z     @given(
2025-05-07T20:32:58.9052725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9052827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9052943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9053073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9053192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9053272Z     )
2025-05-07T20:32:58.9053528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9053630Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9053710Z         self,
2025-05-07T20:32:58.9053803Z         T: int,
2025-05-07T20:32:58.9053883Z         D: int,
2025-05-07T20:32:58.9053997Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9054089Z         contiguous: bool,
2025-05-07T20:32:58.9054181Z         compiled: bool,
2025-05-07T20:32:58.9054316Z     ) -> None:
2025-05-07T20:32:58.9054415Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9054492Z     
2025-05-07T20:32:58.9054665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9054744Z     
2025-05-07T20:32:58.9054847Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9054972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9055071Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9055157Z         x0 = x[:, :D]
2025-05-07T20:32:58.9055241Z         x1 = x[:, D:]
2025-05-07T20:32:58.9055322Z     
2025-05-07T20:32:58.9055410Z         if contiguous:
2025-05-07T20:32:58.9055504Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9055601Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9055672Z     
2025-05-07T20:32:58.9055765Z         if scale_ub is not None:
2025-05-07T20:32:58.9055882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9056024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9056104Z             )
2025-05-07T20:32:58.9056193Z         else:
2025-05-07T20:32:58.9056289Z             scale_ub_tensor = None
2025-05-07T20:32:58.9056364Z     
2025-05-07T20:32:58.9056504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9056595Z             op = silu_mul_quant
2025-05-07T20:32:58.9056689Z             if compiled:
2025-05-07T20:32:58.9056793Z                 op = torch.compile(op)
2025-05-07T20:32:58.9056949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9057031Z     
2025-05-07T20:32:58.9057124Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9057168Z 
2025-05-07T20:32:58.9057269Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9057408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9057513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9057613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9058122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9058223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9058588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9058809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9059148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9059254Z     kernel = self.compile(
2025-05-07T20:32:58.9059645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9059962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9060092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9060185Z 
2025-05-07T20:32:58.9060394Z self = <triton.compiler.compiler.ASTSource object at 0x7f32558baf50>
2025-05-07T20:32:58.9061172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9061667Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559ba560>}
2025-05-07T20:32:58.9062420Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9062608Z context = <triton._C.libtriton.ir.context object at 0x7f32558a3230>
2025-05-07T20:32:58.9062613Z 
2025-05-07T20:32:58.9062781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9063104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9063212Z                            module_map=module_map)
2025-05-07T20:32:58.9063382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9063481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9063559Z E       ^
2025-05-07T20:32:58.9063917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9063923Z 
2025-05-07T20:32:58.9064344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9064349Z 
2025-05-07T20:32:58.9064460Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9064681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9064761Z     T=128,
2025-05-07T20:32:58.9064847Z     D=7168,
2025-05-07T20:32:58.9064933Z     scale_ub=None,
2025-05-07T20:32:58.9065020Z     contiguous=False,
2025-05-07T20:32:58.9065110Z     compiled=True,
2025-05-07T20:32:58.9065184Z )
2025-05-07T20:32:58.9065400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9065578Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9065582Z 
2025-05-07T20:32:58.9065660Z     @given(
2025-05-07T20:32:58.9065787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9065935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9066089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9066216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9066331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9066408Z     )
2025-05-07T20:32:58.9066671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9066769Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9066852Z         self,
2025-05-07T20:32:58.9066938Z         T: int,
2025-05-07T20:32:58.9067016Z         D: int,
2025-05-07T20:32:58.9067118Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9067220Z         contiguous: bool,
2025-05-07T20:32:58.9067308Z         compiled: bool,
2025-05-07T20:32:58.9067395Z     ) -> None:
2025-05-07T20:32:58.9067494Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9067569Z     
2025-05-07T20:32:58.9067752Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9067829Z     
2025-05-07T20:32:58.9067926Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9068057Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9068149Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9068233Z         x0 = x[:, :D]
2025-05-07T20:32:58.9068324Z         x1 = x[:, D:]
2025-05-07T20:32:58.9068399Z     
2025-05-07T20:32:58.9068529Z         if contiguous:
2025-05-07T20:32:58.9068632Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9068725Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9068805Z     
2025-05-07T20:32:58.9068898Z         if scale_ub is not None:
2025-05-07T20:32:58.9069003Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9069144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9069222Z             )
2025-05-07T20:32:58.9069300Z         else:
2025-05-07T20:32:58.9069402Z             scale_ub_tensor = None
2025-05-07T20:32:58.9069478Z     
2025-05-07T20:32:58.9069609Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9069711Z             op = silu_mul_quant
2025-05-07T20:32:58.9069797Z             if compiled:
2025-05-07T20:32:58.9069901Z                 op = torch.compile(op)
2025-05-07T20:32:58.9070015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9070090Z     
2025-05-07T20:32:58.9070183Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9070317Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9070435Z     
2025-05-07T20:32:58.9070580Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9070685Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9070787Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9070918Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9071059Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9071139Z     
2025-05-07T20:32:58.9071250Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9071255Z 
2025-05-07T20:32:58.9071358Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9071494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9071606Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9071743Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9072309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9072413Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9072771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9073000Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9073368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9073739Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9074144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9074395Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9074777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9074947Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9075294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9075374Z     fn()
2025-05-07T20:32:58.9075773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9075869Z     self.fn.run(
2025-05-07T20:32:58.9076214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9076312Z     kernel = self.compile(
2025-05-07T20:32:58.9076695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9076870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9077047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9077054Z 
2025-05-07T20:32:58.9077262Z self = <triton.compiler.compiler.ASTSource object at 0x7f32554a3640>
2025-05-07T20:32:58.9078031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9078544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32559d24d0>}
2025-05-07T20:32:58.9079283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9079482Z context = <triton._C.libtriton.ir.context object at 0x7f3255354370>
2025-05-07T20:32:58.9079527Z 
2025-05-07T20:32:58.9079692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9079959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9080075Z                            module_map=module_map)
2025-05-07T20:32:58.9080238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9080348Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9080430Z E       ^
2025-05-07T20:32:58.9080785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9080791Z 
2025-05-07T20:32:58.9081207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9081212Z 
2025-05-07T20:32:58.9081320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9081554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9081636Z     T=128,
2025-05-07T20:32:58.9081713Z     D=7168,
2025-05-07T20:32:58.9081807Z     scale_ub=None,
2025-05-07T20:32:58.9081894Z     contiguous=False,
2025-05-07T20:32:58.9081983Z     compiled=False,
2025-05-07T20:32:58.9082065Z )
2025-05-07T20:32:58.9082284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9082459Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9082506Z 
2025-05-07T20:32:58.9082592Z     @given(
2025-05-07T20:32:58.9082750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9082858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9082976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9083093Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9083217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9083296Z     )
2025-05-07T20:32:58.9083541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9083644Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9083723Z         self,
2025-05-07T20:32:58.9083801Z         T: int,
2025-05-07T20:32:58.9083884Z         D: int,
2025-05-07T20:32:58.9083984Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9084074Z         contiguous: bool,
2025-05-07T20:32:58.9084165Z         compiled: bool,
2025-05-07T20:32:58.9084246Z     ) -> None:
2025-05-07T20:32:58.9084351Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9084425Z     
2025-05-07T20:32:58.9084597Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9084676Z     
2025-05-07T20:32:58.9084770Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9084898Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9084995Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9085077Z         x0 = x[:, :D]
2025-05-07T20:32:58.9085203Z         x1 = x[:, D:]
2025-05-07T20:32:58.9085287Z     
2025-05-07T20:32:58.9085373Z         if contiguous:
2025-05-07T20:32:58.9085469Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9085570Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9085644Z     
2025-05-07T20:32:58.9085741Z         if scale_ub is not None:
2025-05-07T20:32:58.9085855Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9085991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9086078Z             )
2025-05-07T20:32:58.9086158Z         else:
2025-05-07T20:32:58.9086255Z             scale_ub_tensor = None
2025-05-07T20:32:58.9086341Z     
2025-05-07T20:32:58.9086472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9086563Z             op = silu_mul_quant
2025-05-07T20:32:58.9086658Z             if compiled:
2025-05-07T20:32:58.9086760Z                 op = torch.compile(op)
2025-05-07T20:32:58.9086870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9086950Z     
2025-05-07T20:32:58.9087087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9087091Z 
2025-05-07T20:32:58.9087197Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9087326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9087433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9087542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9088051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9088166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9088555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9088775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9089124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9089221Z     kernel = self.compile(
2025-05-07T20:32:58.9089601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9089782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9090186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9090195Z 
2025-05-07T20:32:58.9090455Z self = <triton.compiler.compiler.ASTSource object at 0x7f325550b580>
2025-05-07T20:32:58.9091464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9091976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255a2e830>}
2025-05-07T20:32:58.9092727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9092918Z context = <triton._C.libtriton.ir.context object at 0x7f32553e20f0>
2025-05-07T20:32:58.9092923Z 
2025-05-07T20:32:58.9093097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9093364Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9093476Z                            module_map=module_map)
2025-05-07T20:32:58.9093651Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9093754Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9093833Z E       ^
2025-05-07T20:32:58.9094259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9094266Z 
2025-05-07T20:32:58.9094686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9094691Z 
2025-05-07T20:32:58.9094804Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9095024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9095117Z     T=4096,
2025-05-07T20:32:58.9095194Z     D=5120,
2025-05-07T20:32:58.9095281Z     scale_ub=1200.0,
2025-05-07T20:32:58.9095376Z     contiguous=True,
2025-05-07T20:32:58.9095462Z     compiled=False,
2025-05-07T20:32:58.9095537Z )
2025-05-07T20:32:58.9095764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9095939Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9095944Z 
2025-05-07T20:32:58.9096032Z     @given(
2025-05-07T20:32:58.9096156Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9096257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9096455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9096576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9096693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9096776Z     )
2025-05-07T20:32:58.9097023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9097118Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9097207Z         self,
2025-05-07T20:32:58.9097286Z         T: int,
2025-05-07T20:32:58.9097372Z         D: int,
2025-05-07T20:32:58.9097482Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9097574Z         contiguous: bool,
2025-05-07T20:32:58.9097667Z         compiled: bool,
2025-05-07T20:32:58.9097749Z     ) -> None:
2025-05-07T20:32:58.9097846Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9097928Z     
2025-05-07T20:32:58.9098101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9098180Z     
2025-05-07T20:32:58.9098279Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9098406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9098498Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9098588Z         x0 = x[:, :D]
2025-05-07T20:32:58.9098673Z         x1 = x[:, D:]
2025-05-07T20:32:58.9098748Z     
2025-05-07T20:32:58.9098839Z         if contiguous:
2025-05-07T20:32:58.9098933Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9099078Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9099151Z     
2025-05-07T20:32:58.9099360Z         if scale_ub is not None:
2025-05-07T20:32:58.9099473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9099609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9099687Z             )
2025-05-07T20:32:58.9099856Z         else:
2025-05-07T20:32:58.9099951Z             scale_ub_tensor = None
2025-05-07T20:32:58.9100024Z     
2025-05-07T20:32:58.9100161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9100249Z             op = silu_mul_quant
2025-05-07T20:32:58.9100333Z             if compiled:
2025-05-07T20:32:58.9100440Z                 op = torch.compile(op)
2025-05-07T20:32:58.9100546Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9100622Z     
2025-05-07T20:32:58.9100714Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9100718Z 
2025-05-07T20:32:58.9100814Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9100952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9101056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9101157Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9101658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9101802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9102165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9102392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9102736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9102836Z     kernel = self.compile(
2025-05-07T20:32:58.9103213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9103392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9103527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9103532Z 
2025-05-07T20:32:58.9103738Z self = <triton.compiler.compiler.ASTSource object at 0x7f32551be650>
2025-05-07T20:32:58.9104512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9105084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255a2e050>}
2025-05-07T20:32:58.9105827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9106018Z context = <triton._C.libtriton.ir.context object at 0x7f32553e6fb0>
2025-05-07T20:32:58.9106022Z 
2025-05-07T20:32:58.9106188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9106453Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9106562Z                            module_map=module_map)
2025-05-07T20:32:58.9106724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9106828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9106901Z E       ^
2025-05-07T20:32:58.9107263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9107268Z 
2025-05-07T20:32:58.9107678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9107726Z 
2025-05-07T20:32:58.9107828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9108095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9108175Z     T=1,
2025-05-07T20:32:58.9108253Z     D=5120,
2025-05-07T20:32:58.9108336Z     scale_ub=None,
2025-05-07T20:32:58.9108420Z     contiguous=True,
2025-05-07T20:32:58.9108508Z     compiled=True,
2025-05-07T20:32:58.9108581Z )
2025-05-07T20:32:58.9108796Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9108965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9108970Z 
2025-05-07T20:32:58.9109043Z     @given(
2025-05-07T20:32:58.9109162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9109267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9109380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9109508Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9109621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9109698Z     )
2025-05-07T20:32:58.9109951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9110044Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9110121Z         self,
2025-05-07T20:32:58.9110205Z         T: int,
2025-05-07T20:32:58.9110279Z         D: int,
2025-05-07T20:32:58.9110421Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9110521Z         contiguous: bool,
2025-05-07T20:32:58.9110607Z         compiled: bool,
2025-05-07T20:32:58.9110684Z     ) -> None:
2025-05-07T20:32:58.9110785Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9110859Z     
2025-05-07T20:32:58.9111034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9111107Z     
2025-05-07T20:32:58.9111201Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9111331Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9111422Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9111504Z         x0 = x[:, :D]
2025-05-07T20:32:58.9111593Z         x1 = x[:, D:]
2025-05-07T20:32:58.9111664Z     
2025-05-07T20:32:58.9111746Z         if contiguous:
2025-05-07T20:32:58.9111847Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9111935Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9112009Z     
2025-05-07T20:32:58.9112109Z         if scale_ub is not None:
2025-05-07T20:32:58.9112214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9112398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9112477Z             )
2025-05-07T20:32:58.9112554Z         else:
2025-05-07T20:32:58.9112656Z             scale_ub_tensor = None
2025-05-07T20:32:58.9112730Z     
2025-05-07T20:32:58.9112858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9112952Z             op = silu_mul_quant
2025-05-07T20:32:58.9113041Z             if compiled:
2025-05-07T20:32:58.9113140Z                 op = torch.compile(op)
2025-05-07T20:32:58.9113256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9113326Z     
2025-05-07T20:32:58.9113418Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9113544Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9113616Z     
2025-05-07T20:32:58.9113751Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9113860Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9113961Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9114089Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9114227Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9114300Z     
2025-05-07T20:32:58.9114408Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9114413Z 
2025-05-07T20:32:58.9114509Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9114634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9114794Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9114972Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9115539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9115642Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9116003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9116232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9116600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9116853Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9117254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9117507Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9117884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9118048Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9118425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9118510Z     fn()
2025-05-07T20:32:58.9118905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9118994Z     self.fn.run(
2025-05-07T20:32:58.9119328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9119422Z     kernel = self.compile(
2025-05-07T20:32:58.9119815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9119991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9120114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9120126Z 
2025-05-07T20:32:58.9120333Z self = <triton.compiler.compiler.ASTSource object at 0x7f32558bb8e0>
2025-05-07T20:32:58.9121101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9121653Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255a2f250>}
2025-05-07T20:32:58.9122403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9122598Z context = <triton._C.libtriton.ir.context object at 0x7f3254d132f0>
2025-05-07T20:32:58.9122603Z 
2025-05-07T20:32:58.9122766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9123027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9123142Z                            module_map=module_map)
2025-05-07T20:32:58.9123302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9123402Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9123482Z E       ^
2025-05-07T20:32:58.9123833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9123838Z 
2025-05-07T20:32:58.9124306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9124348Z 
2025-05-07T20:32:58.9124455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9124672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9124750Z     T=2048,
2025-05-07T20:32:58.9124828Z     D=5120,
2025-05-07T20:32:58.9124915Z     scale_ub=None,
2025-05-07T20:32:58.9125007Z     contiguous=True,
2025-05-07T20:32:58.9125091Z     compiled=True,
2025-05-07T20:32:58.9125169Z )
2025-05-07T20:32:58.9125383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9125554Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9125558Z 
2025-05-07T20:32:58.9125640Z     @given(
2025-05-07T20:32:58.9125758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9125855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9125984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9126102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9126223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9126296Z     )
2025-05-07T20:32:58.9126544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9126644Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9126766Z         self,
2025-05-07T20:32:58.9126842Z         T: int,
2025-05-07T20:32:58.9126926Z         D: int,
2025-05-07T20:32:58.9127026Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9127118Z         contiguous: bool,
2025-05-07T20:32:58.9127209Z         compiled: bool,
2025-05-07T20:32:58.9127284Z     ) -> None:
2025-05-07T20:32:58.9127377Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9127454Z     
2025-05-07T20:32:58.9127621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9127703Z     
2025-05-07T20:32:58.9127795Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9127918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9128013Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9128091Z         x0 = x[:, :D]
2025-05-07T20:32:58.9128170Z         x1 = x[:, D:]
2025-05-07T20:32:58.9128246Z     
2025-05-07T20:32:58.9128330Z         if contiguous:
2025-05-07T20:32:58.9128424Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9128523Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9128639Z     
2025-05-07T20:32:58.9128730Z         if scale_ub is not None:
2025-05-07T20:32:58.9128844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9128978Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9129052Z             )
2025-05-07T20:32:58.9129137Z         else:
2025-05-07T20:32:58.9129232Z             scale_ub_tensor = None
2025-05-07T20:32:58.9129307Z     
2025-05-07T20:32:58.9129436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9129528Z             op = silu_mul_quant
2025-05-07T20:32:58.9129617Z             if compiled:
2025-05-07T20:32:58.9129719Z                 op = torch.compile(op)
2025-05-07T20:32:58.9129825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9129899Z     
2025-05-07T20:32:58.9129987Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9130108Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9130185Z     
2025-05-07T20:32:58.9130322Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9131032Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9131130Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9131251Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9131393Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9131461Z     
2025-05-07T20:32:58.9131560Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9131614Z 
2025-05-07T20:32:58.9131716Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9131882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9131990Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9132129Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9132682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9132792Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9133151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9133370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9133738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9133997Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9134404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9134656Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9135091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9135266Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9135610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9135685Z     fn()
2025-05-07T20:32:58.9136087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9136167Z     self.fn.run(
2025-05-07T20:32:58.9136507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9136605Z     kernel = self.compile(
2025-05-07T20:32:58.9136985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9137167Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9137295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9137300Z 
2025-05-07T20:32:58.9137551Z self = <triton.compiler.compiler.ASTSource object at 0x7f325503b760>
2025-05-07T20:32:58.9138320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9138815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32554ebbe0>}
2025-05-07T20:32:58.9139577Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9139874Z context = <triton._C.libtriton.ir.context object at 0x7f3254ecff30>
2025-05-07T20:32:58.9139879Z 
2025-05-07T20:32:58.9140056Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9140318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9140424Z                            module_map=module_map)
2025-05-07T20:32:58.9140593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9140691Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9140770Z E       ^
2025-05-07T20:32:58.9141122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9141172Z 
2025-05-07T20:32:58.9141632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9141637Z 
2025-05-07T20:32:58.9141748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9141969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9142054Z     T=128,
2025-05-07T20:32:58.9142133Z     D=5120,
2025-05-07T20:32:58.9142217Z     scale_ub=None,
2025-05-07T20:32:58.9142309Z     contiguous=True,
2025-05-07T20:32:58.9142393Z     compiled=True,
2025-05-07T20:32:58.9142462Z )
2025-05-07T20:32:58.9142682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9142849Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9142853Z 
2025-05-07T20:32:58.9142928Z     @given(
2025-05-07T20:32:58.9143058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9143155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9143272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9143394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9143504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9143581Z     )
2025-05-07T20:32:58.9143876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9143970Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9144052Z         self,
2025-05-07T20:32:58.9144128Z         T: int,
2025-05-07T20:32:58.9144203Z         D: int,
2025-05-07T20:32:58.9144307Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9144396Z         contiguous: bool,
2025-05-07T20:32:58.9144480Z         compiled: bool,
2025-05-07T20:32:58.9144564Z     ) -> None:
2025-05-07T20:32:58.9144657Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9144730Z     
2025-05-07T20:32:58.9144907Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9144984Z     
2025-05-07T20:32:58.9145086Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9145215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9145302Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9145389Z         x0 = x[:, :D]
2025-05-07T20:32:58.9145470Z         x1 = x[:, D:]
2025-05-07T20:32:58.9145540Z     
2025-05-07T20:32:58.9145638Z         if contiguous:
2025-05-07T20:32:58.9145778Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9145868Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9145947Z     
2025-05-07T20:32:58.9146036Z         if scale_ub is not None:
2025-05-07T20:32:58.9146142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9146288Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9146361Z             )
2025-05-07T20:32:58.9146441Z         else:
2025-05-07T20:32:58.9146535Z             scale_ub_tensor = None
2025-05-07T20:32:58.9146607Z     
2025-05-07T20:32:58.9146744Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9146836Z             op = silu_mul_quant
2025-05-07T20:32:58.9146923Z             if compiled:
2025-05-07T20:32:58.9147028Z                 op = torch.compile(op)
2025-05-07T20:32:58.9147133Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9147208Z     
2025-05-07T20:32:58.9147307Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9147428Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9147498Z     
2025-05-07T20:32:58.9147641Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9147742Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9147847Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9147969Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9148107Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9148233Z     
2025-05-07T20:32:58.9148333Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9148338Z 
2025-05-07T20:32:58.9148475Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9148610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9148717Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9148853Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9149426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9149533Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9149896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9150116Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9150480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9150749Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9151143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9151397Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9151810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9151985Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9152338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9152413Z     fn()
2025-05-07T20:32:58.9152814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9152900Z     self.fn.run(
2025-05-07T20:32:58.9153238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9153338Z     kernel = self.compile(
2025-05-07T20:32:58.9153717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9153892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9154029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9154077Z 
2025-05-07T20:32:58.9154283Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254d2e0b0>
2025-05-07T20:32:58.9155060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9155568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f24280>}
2025-05-07T20:32:58.9156304Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9156501Z context = <triton._C.libtriton.ir.context object at 0x7f3254febcb0>
2025-05-07T20:32:58.9156507Z 
2025-05-07T20:32:58.9156671Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9156937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9169013Z                            module_map=module_map)
2025-05-07T20:32:58.9169226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9169331Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9169503Z E       ^
2025-05-07T20:32:58.9169909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9169915Z 
2025-05-07T20:32:58.9170335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9170340Z 
2025-05-07T20:32:58.9170457Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9170687Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9170771Z     T=4096,
2025-05-07T20:32:58.9170861Z     D=5120,
2025-05-07T20:32:58.9170946Z     scale_ub=None,
2025-05-07T20:32:58.9171034Z     contiguous=True,
2025-05-07T20:32:58.9171127Z     compiled=True,
2025-05-07T20:32:58.9171209Z )
2025-05-07T20:32:58.9171436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9171610Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9171618Z 
2025-05-07T20:32:58.9171700Z     @given(
2025-05-07T20:32:58.9171845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9171947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9172066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9172197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9172315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9172438Z     )
2025-05-07T20:32:58.9172694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9172798Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9172886Z         self,
2025-05-07T20:32:58.9172968Z         T: int,
2025-05-07T20:32:58.9173056Z         D: int,
2025-05-07T20:32:58.9173170Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9173269Z         contiguous: bool,
2025-05-07T20:32:58.9173358Z         compiled: bool,
2025-05-07T20:32:58.9173450Z     ) -> None:
2025-05-07T20:32:58.9173555Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9173636Z     
2025-05-07T20:32:58.9173819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9173898Z     
2025-05-07T20:32:58.9174003Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9174140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9174237Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9174339Z         x0 = x[:, :D]
2025-05-07T20:32:58.9174424Z         x1 = x[:, D:]
2025-05-07T20:32:58.9174549Z     
2025-05-07T20:32:58.9174647Z         if contiguous:
2025-05-07T20:32:58.9174742Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9174834Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9174921Z     
2025-05-07T20:32:58.9175016Z         if scale_ub is not None:
2025-05-07T20:32:58.9175126Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9175274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9175355Z             )
2025-05-07T20:32:58.9175434Z         else:
2025-05-07T20:32:58.9175544Z             scale_ub_tensor = None
2025-05-07T20:32:58.9175622Z     
2025-05-07T20:32:58.9175761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9175855Z             op = silu_mul_quant
2025-05-07T20:32:58.9175943Z             if compiled:
2025-05-07T20:32:58.9176056Z                 op = torch.compile(op)
2025-05-07T20:32:58.9176168Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9176247Z     
2025-05-07T20:32:58.9176349Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9176475Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9176551Z     
2025-05-07T20:32:58.9176701Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9176808Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9176911Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9177044Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9177236Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9177313Z     
2025-05-07T20:32:58.9177452Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9177457Z 
2025-05-07T20:32:58.9177557Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9177688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9177795Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9177933Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9178510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9178614Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9178987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9179219Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9181525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9181800Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9182206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9182512Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9182896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9183068Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9183423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9183502Z     fn()
2025-05-07T20:32:58.9183904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9183998Z     self.fn.run(
2025-05-07T20:32:58.9184341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9184444Z     kernel = self.compile(
2025-05-07T20:32:58.9184826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9185048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9185186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9185191Z 
2025-05-07T20:32:58.9185406Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254a85180>
2025-05-07T20:32:58.9186188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9186694Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f252d0>}
2025-05-07T20:32:58.9187452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9187658Z context = <triton._C.libtriton.ir.context object at 0x7f32546de2b0>
2025-05-07T20:32:58.9187663Z 
2025-05-07T20:32:58.9187834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9188111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9188222Z                            module_map=module_map)
2025-05-07T20:32:58.9188386Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9188548Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9188628Z E       ^
2025-05-07T20:32:58.9189028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9189042Z 
2025-05-07T20:32:58.9189463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9189470Z 
2025-05-07T20:32:58.9189581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9190037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9190157Z     T=16384,
2025-05-07T20:32:58.9190269Z     D=5120,
2025-05-07T20:32:58.9190398Z     scale_ub=None,
2025-05-07T20:32:58.9190492Z     contiguous=True,
2025-05-07T20:32:58.9190579Z     compiled=True,
2025-05-07T20:32:58.9190668Z )
2025-05-07T20:32:58.9190884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9191074Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9191082Z 
2025-05-07T20:32:58.9191160Z     @given(
2025-05-07T20:32:58.9191282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9191393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9191515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9191785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9191915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9191994Z     )
2025-05-07T20:32:58.9192246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9192352Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9192432Z         self,
2025-05-07T20:32:58.9192513Z         T: int,
2025-05-07T20:32:58.9192589Z         D: int,
2025-05-07T20:32:58.9192690Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9192792Z         contiguous: bool,
2025-05-07T20:32:58.9192878Z         compiled: bool,
2025-05-07T20:32:58.9192958Z     ) -> None:
2025-05-07T20:32:58.9193065Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9193140Z     
2025-05-07T20:32:58.9193308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9193393Z     
2025-05-07T20:32:58.9193493Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9193622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9193722Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9193886Z         x0 = x[:, :D]
2025-05-07T20:32:58.9193968Z         x1 = x[:, D:]
2025-05-07T20:32:58.9194046Z     
2025-05-07T20:32:58.9194138Z         if contiguous:
2025-05-07T20:32:58.9194232Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9194322Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9194401Z     
2025-05-07T20:32:58.9194492Z         if scale_ub is not None:
2025-05-07T20:32:58.9194606Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9194746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9194827Z             )
2025-05-07T20:32:58.9194913Z         else:
2025-05-07T20:32:58.9195012Z             scale_ub_tensor = None
2025-05-07T20:32:58.9195088Z     
2025-05-07T20:32:58.9195227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9195319Z             op = silu_mul_quant
2025-05-07T20:32:58.9195410Z             if compiled:
2025-05-07T20:32:58.9195524Z                 op = torch.compile(op)
2025-05-07T20:32:58.9195637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9195712Z     
2025-05-07T20:32:58.9195820Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9195944Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9196025Z     
2025-05-07T20:32:58.9196165Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9196269Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9196469Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9196596Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9196827Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9196909Z     
2025-05-07T20:32:58.9197010Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9197015Z 
2025-05-07T20:32:58.9197116Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9197254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9197366Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9197512Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9198105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9198220Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9198607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9198839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9199220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9199475Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9199919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9200187Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9200568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9200735Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9201089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9201172Z     fn()
2025-05-07T20:32:58.9201585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9201670Z     self.fn.run(
2025-05-07T20:32:58.9202005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9202107Z     kernel = self.compile(
2025-05-07T20:32:58.9202495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9202719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9202847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9202852Z 
2025-05-07T20:32:58.9203059Z self = <triton.compiler.compiler.ASTSource object at 0x7f32545593c0>
2025-05-07T20:32:58.9203839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9204341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f36320>}
2025-05-07T20:32:58.9205089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9205283Z context = <triton._C.libtriton.ir.context object at 0x7f3133fbe0b0>
2025-05-07T20:32:58.9205288Z 
2025-05-07T20:32:58.9205456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9205727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9205882Z                            module_map=module_map)
2025-05-07T20:32:58.9206091Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9206200Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9206280Z E       ^
2025-05-07T20:32:58.9206642Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9206647Z 
2025-05-07T20:32:58.9207068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9207075Z 
2025-05-07T20:32:58.9207190Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9207412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9207490Z     T=1,
2025-05-07T20:32:58.9207576Z     D=5120,
2025-05-07T20:32:58.9207662Z     scale_ub=1200.0,
2025-05-07T20:32:58.9207750Z     contiguous=True,
2025-05-07T20:32:58.9207843Z     compiled=True,
2025-05-07T20:32:58.9207922Z )
2025-05-07T20:32:58.9208141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9208319Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9208323Z 
2025-05-07T20:32:58.9208403Z     @given(
2025-05-07T20:32:58.9208524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9208631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9208790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9208919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9209035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9209113Z     )
2025-05-07T20:32:58.9209370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9209468Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9209548Z         self,
2025-05-07T20:32:58.9209632Z         T: int,
2025-05-07T20:32:58.9209716Z         D: int,
2025-05-07T20:32:58.9209818Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9209918Z         contiguous: bool,
2025-05-07T20:32:58.9210008Z         compiled: bool,
2025-05-07T20:32:58.9210096Z     ) -> None:
2025-05-07T20:32:58.9210193Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9210269Z     
2025-05-07T20:32:58.9210446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9210523Z     
2025-05-07T20:32:58.9210620Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9210819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9210910Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9210992Z         x0 = x[:, :D]
2025-05-07T20:32:58.9211082Z         x1 = x[:, D:]
2025-05-07T20:32:58.9211158Z     
2025-05-07T20:32:58.9211246Z         if contiguous:
2025-05-07T20:32:58.9211348Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9211439Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9211514Z     
2025-05-07T20:32:58.9211617Z         if scale_ub is not None:
2025-05-07T20:32:58.9211725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9211871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9211950Z             )
2025-05-07T20:32:58.9212030Z         else:
2025-05-07T20:32:58.9212136Z             scale_ub_tensor = None
2025-05-07T20:32:58.9212210Z     
2025-05-07T20:32:58.9212344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9212444Z             op = silu_mul_quant
2025-05-07T20:32:58.9212536Z             if compiled:
2025-05-07T20:32:58.9212641Z                 op = torch.compile(op)
2025-05-07T20:32:58.9212759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9212835Z     
2025-05-07T20:32:58.9212933Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9212944Z 
2025-05-07T20:32:58.9213045Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9213174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9213334Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9213437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9213850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9213954Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9214448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9214553Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9214921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9215145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9215490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9215586Z     kernel = self.compile(
2025-05-07T20:32:58.9215980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9216168Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9216298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9216302Z 
2025-05-07T20:32:58.9216519Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254f44190>
2025-05-07T20:32:58.9217331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9217841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f325442a710>}
2025-05-07T20:32:58.9218658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9218851Z context = <triton._C.libtriton.ir.context object at 0x7f32542af1b0>
2025-05-07T20:32:58.9218856Z 
2025-05-07T20:32:58.9219030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9219300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9219452Z                            module_map=module_map)
2025-05-07T20:32:58.9219623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9219723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9219909Z E       ^
2025-05-07T20:32:58.9220264Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9220269Z 
2025-05-07T20:32:58.9220694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9220698Z 
2025-05-07T20:32:58.9220811Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9221034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9221119Z     T=1,
2025-05-07T20:32:58.9221196Z     D=5120,
2025-05-07T20:32:58.9221280Z     scale_ub=None,
2025-05-07T20:32:58.9221379Z     contiguous=False,
2025-05-07T20:32:58.9221466Z     compiled=True,
2025-05-07T20:32:58.9221544Z )
2025-05-07T20:32:58.9221765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9221929Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9221934Z 
2025-05-07T20:32:58.9222014Z     @given(
2025-05-07T20:32:58.9222140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9222244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9222418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9222539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9222696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9222782Z     )
2025-05-07T20:32:58.9223034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9223130Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9223215Z         self,
2025-05-07T20:32:58.9223300Z         T: int,
2025-05-07T20:32:58.9223380Z         D: int,
2025-05-07T20:32:58.9223488Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9223580Z         contiguous: bool,
2025-05-07T20:32:58.9223668Z         compiled: bool,
2025-05-07T20:32:58.9223758Z     ) -> None:
2025-05-07T20:32:58.9223854Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9223929Z     
2025-05-07T20:32:58.9224103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9224178Z     
2025-05-07T20:32:58.9224284Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9224413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9224509Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9224597Z         x0 = x[:, :D]
2025-05-07T20:32:58.9224681Z         x1 = x[:, D:]
2025-05-07T20:32:58.9224756Z     
2025-05-07T20:32:58.9224849Z         if contiguous:
2025-05-07T20:32:58.9224944Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9225081Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9225163Z     
2025-05-07T20:32:58.9225260Z         if scale_ub is not None:
2025-05-07T20:32:58.9225370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9225512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9225590Z             )
2025-05-07T20:32:58.9225674Z         else:
2025-05-07T20:32:58.9225770Z             scale_ub_tensor = None
2025-05-07T20:32:58.9225844Z     
2025-05-07T20:32:58.9225980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9226076Z             op = silu_mul_quant
2025-05-07T20:32:58.9226164Z             if compiled:
2025-05-07T20:32:58.9226277Z                 op = torch.compile(op)
2025-05-07T20:32:58.9226386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9226462Z     
2025-05-07T20:32:58.9226565Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9226691Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9226766Z     
2025-05-07T20:32:58.9226916Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9227065Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9227177Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9227302Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9227447Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9227534Z     
2025-05-07T20:32:58.9227638Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9227645Z 
2025-05-07T20:32:58.9227747Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9227885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9227997Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9228145Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9228714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9228819Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9229193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9229421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9229789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9230059Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9230665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9230928Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9231308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9231479Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9231831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9231910Z     fn()
2025-05-07T20:32:58.9232313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9232397Z     self.fn.run(
2025-05-07T20:32:58.9232735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9232843Z     kernel = self.compile(
2025-05-07T20:32:58.9233223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9233403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9233538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9233587Z 
2025-05-07T20:32:58.9233809Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254559570>
2025-05-07T20:32:58.9234590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9235098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f34a60>}
2025-05-07T20:32:58.9235865Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9236057Z context = <triton._C.libtriton.ir.context object at 0x7f3254273db0>
2025-05-07T20:32:58.9236062Z 
2025-05-07T20:32:58.9236240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9236546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9236658Z                            module_map=module_map)
2025-05-07T20:32:58.9236830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9236937Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9237017Z E       ^
2025-05-07T20:32:58.9237380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9237387Z 
2025-05-07T20:32:58.9237802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9237807Z 
2025-05-07T20:32:58.9237925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9238147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9238227Z     T=1,
2025-05-07T20:32:58.9238319Z     D=5120,
2025-05-07T20:32:58.9238407Z     scale_ub=None,
2025-05-07T20:32:58.9238496Z     contiguous=True,
2025-05-07T20:32:58.9238591Z     compiled=False,
2025-05-07T20:32:58.9238666Z )
2025-05-07T20:32:58.9238883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9239056Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9239061Z 
2025-05-07T20:32:58.9239141Z     @given(
2025-05-07T20:32:58.9239267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9239420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9239581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9239712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9239828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9239909Z     )
2025-05-07T20:32:58.9240162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9240263Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9240354Z         self,
2025-05-07T20:32:58.9240432Z         T: int,
2025-05-07T20:32:58.9240512Z         D: int,
2025-05-07T20:32:58.9240621Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9240713Z         contiguous: bool,
2025-05-07T20:32:58.9240800Z         compiled: bool,
2025-05-07T20:32:58.9240886Z     ) -> None:
2025-05-07T20:32:58.9240984Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9241058Z     
2025-05-07T20:32:58.9241232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9241310Z     
2025-05-07T20:32:58.9241406Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9241542Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9241633Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9241716Z         x0 = x[:, :D]
2025-05-07T20:32:58.9241804Z         x1 = x[:, D:]
2025-05-07T20:32:58.9241878Z     
2025-05-07T20:32:58.9242015Z         if contiguous:
2025-05-07T20:32:58.9242111Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9242205Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9242291Z     
2025-05-07T20:32:58.9242385Z         if scale_ub is not None:
2025-05-07T20:32:58.9242492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9242639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9242717Z             )
2025-05-07T20:32:58.9242797Z         else:
2025-05-07T20:32:58.9242901Z             scale_ub_tensor = None
2025-05-07T20:32:58.9242980Z     
2025-05-07T20:32:58.9243112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9243213Z             op = silu_mul_quant
2025-05-07T20:32:58.9243301Z             if compiled:
2025-05-07T20:32:58.9243411Z                 op = torch.compile(op)
2025-05-07T20:32:58.9243518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9243592Z     
2025-05-07T20:32:58.9243694Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9243701Z 
2025-05-07T20:32:58.9243801Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9243976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9244088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9244189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9244687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9244794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9245157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9245389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9245734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9245831Z     kernel = self.compile(
2025-05-07T20:32:58.9246221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9246401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9246535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9246539Z 
2025-05-07T20:32:58.9246748Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254431b40>
2025-05-07T20:32:58.9247562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9248101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255071e10>}
2025-05-07T20:32:58.9248844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9249046Z context = <triton._C.libtriton.ir.context object at 0x7f3133c70c70>
2025-05-07T20:32:58.9249051Z 
2025-05-07T20:32:58.9249218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9249481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9249600Z                            module_map=module_map)
2025-05-07T20:32:58.9249767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9249877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9249957Z E       ^
2025-05-07T20:32:58.9250311Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9250315Z 
2025-05-07T20:32:58.9250789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9250797Z 
2025-05-07T20:32:58.9250904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9251133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9251215Z     T=128,
2025-05-07T20:32:58.9251293Z     D=5120,
2025-05-07T20:32:58.9251388Z     scale_ub=None,
2025-05-07T20:32:58.9251478Z     contiguous=False,
2025-05-07T20:32:58.9251564Z     compiled=True,
2025-05-07T20:32:58.9251653Z )
2025-05-07T20:32:58.9251870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9252048Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9252053Z 
2025-05-07T20:32:58.9252143Z     @given(
2025-05-07T20:32:58.9252265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9252372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9252493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9252659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9252783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9252858Z     )
2025-05-07T20:32:58.9253111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9253215Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9253293Z         self,
2025-05-07T20:32:58.9253373Z         T: int,
2025-05-07T20:32:58.9253465Z         D: int,
2025-05-07T20:32:58.9253571Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9253663Z         contiguous: bool,
2025-05-07T20:32:58.9253762Z         compiled: bool,
2025-05-07T20:32:58.9253844Z     ) -> None:
2025-05-07T20:32:58.9253950Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9254026Z     
2025-05-07T20:32:58.9254197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9254282Z     
2025-05-07T20:32:58.9254379Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9254511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9254612Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9254694Z         x0 = x[:, :D]
2025-05-07T20:32:58.9254778Z         x1 = x[:, D:]
2025-05-07T20:32:58.9254860Z     
2025-05-07T20:32:58.9254945Z         if contiguous:
2025-05-07T20:32:58.9255038Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9255134Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9255207Z     
2025-05-07T20:32:58.9255299Z         if scale_ub is not None:
2025-05-07T20:32:58.9255460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9255638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9255725Z             )
2025-05-07T20:32:58.9255805Z         else:
2025-05-07T20:32:58.9255901Z             scale_ub_tensor = None
2025-05-07T20:32:58.9255983Z     
2025-05-07T20:32:58.9256113Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9256206Z             op = silu_mul_quant
2025-05-07T20:32:58.9256302Z             if compiled:
2025-05-07T20:32:58.9256404Z                 op = torch.compile(op)
2025-05-07T20:32:58.9256513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9256594Z     
2025-05-07T20:32:58.9256688Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9256692Z 
2025-05-07T20:32:58.9256797Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9256925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9257032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9257140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9257511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9257606Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9258199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9258301Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9258672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9258895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9259240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9259339Z     kernel = self.compile(
2025-05-07T20:32:58.9259728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9260010Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9260145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9260150Z 
2025-05-07T20:32:58.9260359Z self = <triton.compiler.compiler.ASTSource object at 0x7f32542b60b0>
2025-05-07T20:32:58.9261138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9261708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3255070310>}
2025-05-07T20:32:58.9262454Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9262648Z context = <triton._C.libtriton.ir.context object at 0x7f3133c422f0>
2025-05-07T20:32:58.9262653Z 
2025-05-07T20:32:58.9262821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9263095Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9263204Z                            module_map=module_map)
2025-05-07T20:32:58.9263378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9263478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9263559Z E       ^
2025-05-07T20:32:58.9263919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9263924Z 
2025-05-07T20:32:58.9264342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9264390Z 
2025-05-07T20:32:58.9264537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9264768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9264847Z     T=128,
2025-05-07T20:32:58.9264933Z     D=7168,
2025-05-07T20:32:58.9265018Z     scale_ub=1200.0,
2025-05-07T20:32:58.9265108Z     contiguous=False,
2025-05-07T20:32:58.9265203Z     compiled=False,
2025-05-07T20:32:58.9265282Z )
2025-05-07T20:32:58.9265500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9265682Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9265686Z 
2025-05-07T20:32:58.9265765Z     @given(
2025-05-07T20:32:58.9265886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9265992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9266110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9266240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9266357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9266436Z     )
2025-05-07T20:32:58.9266694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9266790Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9266869Z         self,
2025-05-07T20:32:58.9267000Z         T: int,
2025-05-07T20:32:58.9267080Z         D: int,
2025-05-07T20:32:58.9267186Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9267283Z         contiguous: bool,
2025-05-07T20:32:58.9267371Z         compiled: bool,
2025-05-07T20:32:58.9267453Z     ) -> None:
2025-05-07T20:32:58.9267554Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9267629Z     
2025-05-07T20:32:58.9267805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9267881Z     
2025-05-07T20:32:58.9267976Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9268114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9268208Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9268291Z         x0 = x[:, :D]
2025-05-07T20:32:58.9268379Z         x1 = x[:, D:]
2025-05-07T20:32:58.9268452Z     
2025-05-07T20:32:58.9268541Z         if contiguous:
2025-05-07T20:32:58.9268642Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9268732Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9268809Z     
2025-05-07T20:32:58.9268960Z         if scale_ub is not None:
2025-05-07T20:32:58.9269067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9269211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9269290Z             )
2025-05-07T20:32:58.9269368Z         else:
2025-05-07T20:32:58.9269469Z             scale_ub_tensor = None
2025-05-07T20:32:58.9269544Z     
2025-05-07T20:32:58.9269675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9269779Z             op = silu_mul_quant
2025-05-07T20:32:58.9269866Z             if compiled:
2025-05-07T20:32:58.9269970Z                 op = torch.compile(op)
2025-05-07T20:32:58.9270086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9270160Z     
2025-05-07T20:32:58.9270253Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9270257Z 
2025-05-07T20:32:58.9270364Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9270495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9270608Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9270710Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9271206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9271311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9271670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9271944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9272334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9272433Z     kernel = self.compile(
2025-05-07T20:32:58.9272821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9273002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9273134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9273138Z 
2025-05-07T20:32:58.9273348Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133c19300>
2025-05-07T20:32:58.9274131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9274640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254f35900>}
2025-05-07T20:32:58.9275433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9275625Z context = <triton._C.libtriton.ir.context object at 0x7f3133b7aab0>
2025-05-07T20:32:58.9275638Z 
2025-05-07T20:32:58.9275809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9276077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9276192Z                            module_map=module_map)
2025-05-07T20:32:58.9276354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9276459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9276549Z E       ^
2025-05-07T20:32:58.9276906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9276910Z 
2025-05-07T20:32:58.9277336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9277340Z 
2025-05-07T20:32:58.9277449Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9277714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9277797Z     T=128,
2025-05-07T20:32:58.9277877Z     D=5120,
2025-05-07T20:32:58.9277961Z     scale_ub=None,
2025-05-07T20:32:58.9278056Z     contiguous=False,
2025-05-07T20:32:58.9278141Z     compiled=False,
2025-05-07T20:32:58.9278218Z )
2025-05-07T20:32:58.9278440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9278616Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9278621Z 
2025-05-07T20:32:58.9278705Z     @given(
2025-05-07T20:32:58.9278829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9278932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9279055Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9279172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9279294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9279379Z     )
2025-05-07T20:32:58.9279630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9279736Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9279815Z         self,
2025-05-07T20:32:58.9279893Z         T: int,
2025-05-07T20:32:58.9279977Z         D: int,
2025-05-07T20:32:58.9280077Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9280169Z         contiguous: bool,
2025-05-07T20:32:58.9280315Z         compiled: bool,
2025-05-07T20:32:58.9280395Z     ) -> None:
2025-05-07T20:32:58.9280492Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9280616Z     
2025-05-07T20:32:58.9280787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9280865Z     
2025-05-07T20:32:58.9280967Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9281095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9281190Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9281282Z         x0 = x[:, :D]
2025-05-07T20:32:58.9281367Z         x1 = x[:, D:]
2025-05-07T20:32:58.9281453Z     
2025-05-07T20:32:58.9281539Z         if contiguous:
2025-05-07T20:32:58.9281634Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9281731Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9281804Z     
2025-05-07T20:32:58.9281895Z         if scale_ub is not None:
2025-05-07T20:32:58.9282008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9282148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9282226Z             )
2025-05-07T20:32:58.9282310Z         else:
2025-05-07T20:32:58.9282409Z             scale_ub_tensor = None
2025-05-07T20:32:58.9282487Z     
2025-05-07T20:32:58.9282622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9282714Z             op = silu_mul_quant
2025-05-07T20:32:58.9282809Z             if compiled:
2025-05-07T20:32:58.9282955Z                 op = torch.compile(op)
2025-05-07T20:32:58.9283064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9283148Z     
2025-05-07T20:32:58.9283244Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9283248Z 
2025-05-07T20:32:58.9283348Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9283483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9283587Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9283686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9284204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9284305Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9284676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9284900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9285243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9285386Z     kernel = self.compile(
2025-05-07T20:32:58.9285774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9285961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9286088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9286096Z 
2025-05-07T20:32:58.9286304Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133c25900>
2025-05-07T20:32:58.9287079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9287578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf2a70>}
2025-05-07T20:32:58.9288359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9288573Z context = <triton._C.libtriton.ir.context object at 0x7f3133b89af0>
2025-05-07T20:32:58.9288578Z 
2025-05-07T20:32:58.9288743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9289093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9289204Z                            module_map=module_map)
2025-05-07T20:32:58.9289376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9289477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9289555Z E       ^
2025-05-07T20:32:58.9290175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9290189Z 
2025-05-07T20:32:58.9290667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9290674Z 
2025-05-07T20:32:58.9290787Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9291009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9291088Z     T=128,
2025-05-07T20:32:58.9291175Z     D=5120,
2025-05-07T20:32:58.9291259Z     scale_ub=1200.0,
2025-05-07T20:32:58.9291345Z     contiguous=True,
2025-05-07T20:32:58.9291439Z     compiled=False,
2025-05-07T20:32:58.9291515Z )
2025-05-07T20:32:58.9291732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9291909Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9291914Z 
2025-05-07T20:32:58.9292163Z     @given(
2025-05-07T20:32:58.9292294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9292399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9292518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9292642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9292758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9292833Z     )
2025-05-07T20:32:58.9293085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9293184Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9293266Z         self,
2025-05-07T20:32:58.9293351Z         T: int,
2025-05-07T20:32:58.9293429Z         D: int,
2025-05-07T20:32:58.9293530Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9293632Z         contiguous: bool,
2025-05-07T20:32:58.9293719Z         compiled: bool,
2025-05-07T20:32:58.9293806Z     ) -> None:
2025-05-07T20:32:58.9293902Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9293980Z     
2025-05-07T20:32:58.9294228Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9294304Z     
2025-05-07T20:32:58.9294398Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9294529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9294620Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9294704Z         x0 = x[:, :D]
2025-05-07T20:32:58.9294791Z         x1 = x[:, D:]
2025-05-07T20:32:58.9294865Z     
2025-05-07T20:32:58.9294955Z         if contiguous:
2025-05-07T20:32:58.9295068Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9300767Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9300875Z     
2025-05-07T20:32:58.9300982Z         if scale_ub is not None:
2025-05-07T20:32:58.9301096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9301246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9301324Z             )
2025-05-07T20:32:58.9301418Z         else:
2025-05-07T20:32:58.9301518Z             scale_ub_tensor = None
2025-05-07T20:32:58.9301598Z     
2025-05-07T20:32:58.9301744Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9301839Z             op = silu_mul_quant
2025-05-07T20:32:58.9301929Z             if compiled:
2025-05-07T20:32:58.9302048Z                 op = torch.compile(op)
2025-05-07T20:32:58.9302158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9302237Z     
2025-05-07T20:32:58.9302342Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9302471Z 
2025-05-07T20:32:58.9302576Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9302782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9302900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9303009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9303526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9303630Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9303998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9304235Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9304581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9304687Z     kernel = self.compile(
2025-05-07T20:32:58.9305075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9305255Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9305395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9305400Z 
2025-05-07T20:32:58.9305608Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133c04700>
2025-05-07T20:32:58.9306420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9306944Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf32e0>}
2025-05-07T20:32:58.9307690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9307891Z context = <triton._C.libtriton.ir.context object at 0x7f3133b84eb0>
2025-05-07T20:32:58.9307896Z 
2025-05-07T20:32:58.9308066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9308343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9308498Z                            module_map=module_map)
2025-05-07T20:32:58.9308664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9308773Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9308855Z E       ^
2025-05-07T20:32:58.9309211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9309216Z 
2025-05-07T20:32:58.9309635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9309643Z 
2025-05-07T20:32:58.9309753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9309987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9310069Z     T=1,
2025-05-07T20:32:58.9310150Z     D=7168,
2025-05-07T20:32:58.9310247Z     scale_ub=1200.0,
2025-05-07T20:32:58.9310336Z     contiguous=True,
2025-05-07T20:32:58.9310425Z     compiled=True,
2025-05-07T20:32:58.9310514Z )
2025-05-07T20:32:58.9310734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9310904Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9310918Z 
2025-05-07T20:32:58.9310998Z     @given(
2025-05-07T20:32:58.9311119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9311231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9311396Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9311518Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9311685Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9311766Z     )
2025-05-07T20:32:58.9312018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9312125Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9312206Z         self,
2025-05-07T20:32:58.9312296Z         T: int,
2025-05-07T20:32:58.9312376Z         D: int,
2025-05-07T20:32:58.9312481Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9312582Z         contiguous: bool,
2025-05-07T20:32:58.9312672Z         compiled: bool,
2025-05-07T20:32:58.9312754Z     ) -> None:
2025-05-07T20:32:58.9312859Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9312935Z     
2025-05-07T20:32:58.9313109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9313193Z     
2025-05-07T20:32:58.9313291Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9313418Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9313518Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9313604Z         x0 = x[:, :D]
2025-05-07T20:32:58.9313691Z         x1 = x[:, D:]
2025-05-07T20:32:58.9313774Z     
2025-05-07T20:32:58.9313862Z         if contiguous:
2025-05-07T20:32:58.9313964Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9314057Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9314177Z     
2025-05-07T20:32:58.9314284Z         if scale_ub is not None:
2025-05-07T20:32:58.9314392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9314531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9314618Z             )
2025-05-07T20:32:58.9314700Z         else:
2025-05-07T20:32:58.9314799Z             scale_ub_tensor = None
2025-05-07T20:32:58.9314882Z     
2025-05-07T20:32:58.9315015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9315112Z             op = silu_mul_quant
2025-05-07T20:32:58.9315211Z             if compiled:
2025-05-07T20:32:58.9315317Z                 op = torch.compile(op)
2025-05-07T20:32:58.9315437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9315512Z     
2025-05-07T20:32:58.9315608Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9315613Z 
2025-05-07T20:32:58.9315722Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9315856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9316010Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9316123Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9316495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9316592Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9317103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9317208Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9317575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9317800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9318165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9318298Z     kernel = self.compile(
2025-05-07T20:32:58.9318691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9318881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9319013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9319018Z 
2025-05-07T20:32:58.9319226Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133b466e0>
2025-05-07T20:32:58.9320085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9320594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf30a0>}
2025-05-07T20:32:58.9321347Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9321544Z context = <triton._C.libtriton.ir.context object at 0x7f3133a42030>
2025-05-07T20:32:58.9321549Z 
2025-05-07T20:32:58.9321720Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9321997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9322112Z                            module_map=module_map)
2025-05-07T20:32:58.9322295Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9322397Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9322478Z E       ^
2025-05-07T20:32:58.9322840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9322885Z 
2025-05-07T20:32:58.9323307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9323314Z 
2025-05-07T20:32:58.9323430Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9323654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9323736Z     T=1,
2025-05-07T20:32:58.9323826Z     D=7168,
2025-05-07T20:32:58.9323913Z     scale_ub=1200.0,
2025-05-07T20:32:58.9324005Z     contiguous=False,
2025-05-07T20:32:58.9324104Z     compiled=True,
2025-05-07T20:32:58.9324181Z )
2025-05-07T20:32:58.9324403Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9324582Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9324587Z 
2025-05-07T20:32:58.9324669Z     @given(
2025-05-07T20:32:58.9324800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9324906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9325065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9325191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9325317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9325395Z     )
2025-05-07T20:32:58.9325642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9325744Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9325824Z         self,
2025-05-07T20:32:58.9325909Z         T: int,
2025-05-07T20:32:58.9325994Z         D: int,
2025-05-07T20:32:58.9326097Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9326192Z         contiguous: bool,
2025-05-07T20:32:58.9326289Z         compiled: bool,
2025-05-07T20:32:58.9326372Z     ) -> None:
2025-05-07T20:32:58.9326469Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9326550Z     
2025-05-07T20:32:58.9326722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9326807Z     
2025-05-07T20:32:58.9326906Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9327035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9327140Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9327224Z         x0 = x[:, :D]
2025-05-07T20:32:58.9327307Z         x1 = x[:, D:]
2025-05-07T20:32:58.9327389Z     
2025-05-07T20:32:58.9327476Z         if contiguous:
2025-05-07T20:32:58.9327570Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9327673Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9327795Z     
2025-05-07T20:32:58.9327891Z         if scale_ub is not None:
2025-05-07T20:32:58.9328072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9328211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9328299Z             )
2025-05-07T20:32:58.9328381Z         else:
2025-05-07T20:32:58.9328478Z             scale_ub_tensor = None
2025-05-07T20:32:58.9328562Z     
2025-05-07T20:32:58.9328697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9328792Z             op = silu_mul_quant
2025-05-07T20:32:58.9328887Z             if compiled:
2025-05-07T20:32:58.9328990Z                 op = torch.compile(op)
2025-05-07T20:32:58.9329100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9329181Z     
2025-05-07T20:32:58.9329277Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9329281Z 
2025-05-07T20:32:58.9329384Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9329522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9329630Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9329742Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9330115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9330210Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9330752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9330860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9331221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9331456Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9331800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9331905Z     kernel = self.compile(
2025-05-07T20:32:58.9332289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9332468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9332601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9332605Z 
2025-05-07T20:32:58.9332815Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133a14ca0>
2025-05-07T20:32:58.9333632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9334131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf24d0>}
2025-05-07T20:32:58.9334878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9335076Z context = <triton._C.libtriton.ir.context object at 0x7f3133a010b0>
2025-05-07T20:32:58.9335080Z 
2025-05-07T20:32:58.9335249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9335525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9335637Z                            module_map=module_map)
2025-05-07T20:32:58.9335803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9335911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9335992Z E       ^
2025-05-07T20:32:58.9336351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9336401Z 
2025-05-07T20:32:58.9336861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9336866Z 
2025-05-07T20:32:58.9336973Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9337201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9337283Z     T=1,
2025-05-07T20:32:58.9337363Z     D=7168,
2025-05-07T20:32:58.9337456Z     scale_ub=None,
2025-05-07T20:32:58.9337546Z     contiguous=False,
2025-05-07T20:32:58.9337641Z     compiled=True,
2025-05-07T20:32:58.9337716Z )
2025-05-07T20:32:58.9337932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9338103Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9338107Z 
2025-05-07T20:32:58.9338187Z     @given(
2025-05-07T20:32:58.9338306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9338413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9338559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9338697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9338825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9338901Z     )
2025-05-07T20:32:58.9339155Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9339252Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9339379Z         self,
2025-05-07T20:32:58.9339468Z         T: int,
2025-05-07T20:32:58.9339547Z         D: int,
2025-05-07T20:32:58.9339647Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9339746Z         contiguous: bool,
2025-05-07T20:32:58.9339912Z         compiled: bool,
2025-05-07T20:32:58.9339994Z     ) -> None:
2025-05-07T20:32:58.9340099Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9340175Z     
2025-05-07T20:32:58.9340345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9340432Z     
2025-05-07T20:32:58.9340526Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9340663Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9340757Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9340840Z         x0 = x[:, :D]
2025-05-07T20:32:58.9340929Z         x1 = x[:, D:]
2025-05-07T20:32:58.9341005Z     
2025-05-07T20:32:58.9341091Z         if contiguous:
2025-05-07T20:32:58.9341196Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9341295Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9341417Z     
2025-05-07T20:32:58.9341517Z         if scale_ub is not None:
2025-05-07T20:32:58.9341627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9341769Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9341856Z             )
2025-05-07T20:32:58.9341936Z         else:
2025-05-07T20:32:58.9342033Z             scale_ub_tensor = None
2025-05-07T20:32:58.9342116Z     
2025-05-07T20:32:58.9342248Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9342349Z             op = silu_mul_quant
2025-05-07T20:32:58.9342441Z             if compiled:
2025-05-07T20:32:58.9342545Z                 op = torch.compile(op)
2025-05-07T20:32:58.9342666Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9342742Z     
2025-05-07T20:32:58.9342837Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.9342968Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.9343049Z     
2025-05-07T20:32:58.9343195Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9343306Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.9343411Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.9343543Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.9343687Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9343763Z     
2025-05-07T20:32:58.9343874Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.9343925Z 
2025-05-07T20:32:58.9344026Z moe/activation_test.py:126: 
2025-05-07T20:32:58.9344195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9344309Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.9344443Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.9345009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.9345115Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.9345475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9345705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9346070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.9346329Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9346741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.9346991Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.9347410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.9347581Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.9347926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.9348008Z     fn()
2025-05-07T20:32:58.9348441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.9348541Z     self.fn.run(
2025-05-07T20:32:58.9348888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9348985Z     kernel = self.compile(
2025-05-07T20:32:58.9349378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9349557Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9349688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9349692Z 
2025-05-07T20:32:58.9350013Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133aa7220>
2025-05-07T20:32:58.9350783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9351294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254552dd0>}
2025-05-07T20:32:58.9352039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9352234Z context = <triton._C.libtriton.ir.context object at 0x7f3133e80e30>
2025-05-07T20:32:58.9352239Z 
2025-05-07T20:32:58.9352405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9352671Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9352787Z                            module_map=module_map)
2025-05-07T20:32:58.9352950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9353053Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.9353138Z E       ^
2025-05-07T20:32:58.9353488Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9353541Z 
2025-05-07T20:32:58.9354003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9354008Z 
2025-05-07T20:32:58.9354113Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9354335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9354421Z     T=1,
2025-05-07T20:32:58.9354501Z     D=5120,
2025-05-07T20:32:58.9354584Z     scale_ub=1200.0,
2025-05-07T20:32:58.9354676Z     contiguous=False,
2025-05-07T20:32:58.9354760Z     compiled=True,
2025-05-07T20:32:58.9354839Z )
2025-05-07T20:32:58.9355054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9355218Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9355227Z 
2025-05-07T20:32:58.9355307Z     @given(
2025-05-07T20:32:58.9355429Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9355530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9355656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9355772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9355886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9355964Z     )
2025-05-07T20:32:58.9356254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9356359Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9356437Z         self,
2025-05-07T20:32:58.9356513Z         T: int,
2025-05-07T20:32:58.9356599Z         D: int,
2025-05-07T20:32:58.9356697Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9356786Z         contiguous: bool,
2025-05-07T20:32:58.9356877Z         compiled: bool,
2025-05-07T20:32:58.9356957Z     ) -> None:
2025-05-07T20:32:58.9357055Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9357134Z     
2025-05-07T20:32:58.9357302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9357378Z     
2025-05-07T20:32:58.9357482Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9357608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9357705Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9357786Z         x0 = x[:, :D]
2025-05-07T20:32:58.9357864Z         x1 = x[:, D:]
2025-05-07T20:32:58.9357947Z     
2025-05-07T20:32:58.9358034Z         if contiguous:
2025-05-07T20:32:58.9358176Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9358273Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9358352Z     
2025-05-07T20:32:58.9358450Z         if scale_ub is not None:
2025-05-07T20:32:58.9358563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9358698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9358778Z             )
2025-05-07T20:32:58.9358863Z         else:
2025-05-07T20:32:58.9358959Z             scale_ub_tensor = None
2025-05-07T20:32:58.9359036Z     
2025-05-07T20:32:58.9359173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9359261Z             op = silu_mul_quant
2025-05-07T20:32:58.9359356Z             if compiled:
2025-05-07T20:32:58.9359459Z                 op = torch.compile(op)
2025-05-07T20:32:58.9359566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9359649Z     
2025-05-07T20:32:58.9359745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9359752Z 
2025-05-07T20:32:58.9359851Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9359989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9360097Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9360201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9360574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9360670Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9361297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9361396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9361756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9361991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9362336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9362443Z     kernel = self.compile(
2025-05-07T20:32:58.9362821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9363025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9363151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9363159Z 
2025-05-07T20:32:58.9363369Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133bde3b0>
2025-05-07T20:32:58.9364146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9364685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254553eb0>}
2025-05-07T20:32:58.9365453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9365643Z context = <triton._C.libtriton.ir.context object at 0x7f3133ec2ab0>
2025-05-07T20:32:58.9365647Z 
2025-05-07T20:32:58.9365814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9366093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9366200Z                            module_map=module_map)
2025-05-07T20:32:58.9366371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9366470Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9366545Z E       ^
2025-05-07T20:32:58.9366907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9366955Z 
2025-05-07T20:32:58.9367370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9367375Z 
2025-05-07T20:32:58.9367488Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9367711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9367791Z     T=1,
2025-05-07T20:32:58.9367871Z     D=5120,
2025-05-07T20:32:58.9367958Z     scale_ub=1200.0,
2025-05-07T20:32:58.9368045Z     contiguous=False,
2025-05-07T20:32:58.9368133Z     compiled=False,
2025-05-07T20:32:58.9368208Z )
2025-05-07T20:32:58.9368422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9368620Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9368628Z 
2025-05-07T20:32:58.9368710Z     @given(
2025-05-07T20:32:58.9368856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9368957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9369073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9369197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9369311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9369383Z     )
2025-05-07T20:32:58.9369636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9369861Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9369940Z         self,
2025-05-07T20:32:58.9370064Z         T: int,
2025-05-07T20:32:58.9370143Z         D: int,
2025-05-07T20:32:58.9370242Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9370336Z         contiguous: bool,
2025-05-07T20:32:58.9370421Z         compiled: bool,
2025-05-07T20:32:58.9370505Z     ) -> None:
2025-05-07T20:32:58.9370602Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9370677Z     
2025-05-07T20:32:58.9370854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9370929Z     
2025-05-07T20:32:58.9371020Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9371149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9371238Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9371318Z         x0 = x[:, :D]
2025-05-07T20:32:58.9371402Z         x1 = x[:, D:]
2025-05-07T20:32:58.9371478Z     
2025-05-07T20:32:58.9371562Z         if contiguous:
2025-05-07T20:32:58.9371661Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9371752Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9371827Z     
2025-05-07T20:32:58.9371925Z         if scale_ub is not None:
2025-05-07T20:32:58.9372030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9372171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9372249Z             )
2025-05-07T20:32:58.9372372Z         else:
2025-05-07T20:32:58.9372480Z             scale_ub_tensor = None
2025-05-07T20:32:58.9372555Z     
2025-05-07T20:32:58.9372686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9372784Z             op = silu_mul_quant
2025-05-07T20:32:58.9372870Z             if compiled:
2025-05-07T20:32:58.9372974Z                 op = torch.compile(op)
2025-05-07T20:32:58.9373087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9373161Z     
2025-05-07T20:32:58.9373258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9373269Z 
2025-05-07T20:32:58.9373368Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9373498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9373604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9373703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9374206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9374356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9374716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9374945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9375283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9375383Z     kernel = self.compile(
2025-05-07T20:32:58.9375778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9375955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9376079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9376083Z 
2025-05-07T20:32:58.9376294Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133e31c30>
2025-05-07T20:32:58.9377060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9377575Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3254cf0940>}
2025-05-07T20:32:58.9378358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9378594Z context = <triton._C.libtriton.ir.context object at 0x7f31337b8130>
2025-05-07T20:32:58.9378599Z 
2025-05-07T20:32:58.9378765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9379075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9379192Z                            module_map=module_map)
2025-05-07T20:32:58.9379353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9379451Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9379543Z E       ^
2025-05-07T20:32:58.9380026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9380031Z 
2025-05-07T20:32:58.9380456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9380463Z 
2025-05-07T20:32:58.9380569Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9380790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9380873Z     T=16384,
2025-05-07T20:32:58.9380953Z     D=5120,
2025-05-07T20:32:58.9381036Z     scale_ub=1200.0,
2025-05-07T20:32:58.9381172Z     contiguous=False,
2025-05-07T20:32:58.9381262Z     compiled=True,
2025-05-07T20:32:58.9381343Z )
2025-05-07T20:32:58.9381559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9381740Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9381744Z 
2025-05-07T20:32:58.9381831Z     @given(
2025-05-07T20:32:58.9381949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9382051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9382178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9382295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9382409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9382491Z     )
2025-05-07T20:32:58.9382736Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9382840Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9382921Z         self,
2025-05-07T20:32:58.9383000Z         T: int,
2025-05-07T20:32:58.9383127Z         D: int,
2025-05-07T20:32:58.9383228Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9383319Z         contiguous: bool,
2025-05-07T20:32:58.9383413Z         compiled: bool,
2025-05-07T20:32:58.9383491Z     ) -> None:
2025-05-07T20:32:58.9383587Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9383668Z     
2025-05-07T20:32:58.9383837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9383915Z     
2025-05-07T20:32:58.9384020Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9384146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9384246Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9384326Z         x0 = x[:, :D]
2025-05-07T20:32:58.9384406Z         x1 = x[:, D:]
2025-05-07T20:32:58.9384487Z     
2025-05-07T20:32:58.9384574Z         if contiguous:
2025-05-07T20:32:58.9384667Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9384766Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9384840Z     
2025-05-07T20:32:58.9384930Z         if scale_ub is not None:
2025-05-07T20:32:58.9385042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9385178Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9385254Z             )
2025-05-07T20:32:58.9385337Z         else:
2025-05-07T20:32:58.9385431Z             scale_ub_tensor = None
2025-05-07T20:32:58.9385511Z     
2025-05-07T20:32:58.9385641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9385779Z             op = silu_mul_quant
2025-05-07T20:32:58.9385871Z             if compiled:
2025-05-07T20:32:58.9386025Z                 op = torch.compile(op)
2025-05-07T20:32:58.9386136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9386213Z     
2025-05-07T20:32:58.9386304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9386309Z 
2025-05-07T20:32:58.9386405Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9386542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9386646Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9386751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9387123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9387216Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9387718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9387824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9388230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9388457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9388845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9388952Z     kernel = self.compile(
2025-05-07T20:32:58.9389332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9389505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9389636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9389641Z 
2025-05-07T20:32:58.9390689Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133e32260>
2025-05-07T20:32:58.9391583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9392083Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337c88b0>}
2025-05-07T20:32:58.9392822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9393263Z context = <triton._C.libtriton.ir.context object at 0x7f313371ccf0>
2025-05-07T20:32:58.9393270Z 
2025-05-07T20:32:58.9393438Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9393711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9393826Z                            module_map=module_map)
2025-05-07T20:32:58.9393992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9394100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9394179Z E       ^
2025-05-07T20:32:58.9394534Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9394545Z 
2025-05-07T20:32:58.9394966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9394971Z 
2025-05-07T20:32:58.9395077Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9395307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9395386Z     T=2048,
2025-05-07T20:32:58.9395465Z     D=7168,
2025-05-07T20:32:58.9395560Z     scale_ub=1200.0,
2025-05-07T20:32:58.9395737Z     contiguous=False,
2025-05-07T20:32:58.9395823Z     compiled=True,
2025-05-07T20:32:58.9395908Z )
2025-05-07T20:32:58.9396193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9396378Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9396383Z 
2025-05-07T20:32:58.9396461Z     @given(
2025-05-07T20:32:58.9396584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9396691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9396812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9396933Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9397054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9397134Z     )
2025-05-07T20:32:58.9397388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9397484Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9397567Z         self,
2025-05-07T20:32:58.9397655Z         T: int,
2025-05-07T20:32:58.9397733Z         D: int,
2025-05-07T20:32:58.9397836Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9397935Z         contiguous: bool,
2025-05-07T20:32:58.9398023Z         compiled: bool,
2025-05-07T20:32:58.9398106Z     ) -> None:
2025-05-07T20:32:58.9398210Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9398286Z     
2025-05-07T20:32:58.9398527Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9398614Z     
2025-05-07T20:32:58.9398709Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9398835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9398934Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9399016Z         x0 = x[:, :D]
2025-05-07T20:32:58.9399109Z         x1 = x[:, D:]
2025-05-07T20:32:58.9399185Z     
2025-05-07T20:32:58.9399273Z         if contiguous:
2025-05-07T20:32:58.9399376Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9399471Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9399546Z     
2025-05-07T20:32:58.9399647Z         if scale_ub is not None:
2025-05-07T20:32:58.9399758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9399894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9399979Z             )
2025-05-07T20:32:58.9400058Z         else:
2025-05-07T20:32:58.9400154Z             scale_ub_tensor = None
2025-05-07T20:32:58.9400237Z     
2025-05-07T20:32:58.9400368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9400511Z             op = silu_mul_quant
2025-05-07T20:32:58.9400598Z             if compiled:
2025-05-07T20:32:58.9400701Z                 op = torch.compile(op)
2025-05-07T20:32:58.9400813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9400891Z     
2025-05-07T20:32:58.9400985Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9400989Z 
2025-05-07T20:32:58.9401095Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9401227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9401332Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9401440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9401812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9401913Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9402413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9402517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9402878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9403100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9403443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9403593Z     kernel = self.compile(
2025-05-07T20:32:58.9404018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9404203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9404330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9404335Z 
2025-05-07T20:32:58.9404539Z self = <triton.compiler.compiler.ASTSource object at 0x7f313372ab90>
2025-05-07T20:32:58.9405318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9405820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337c9090>}
2025-05-07T20:32:58.9406572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9406763Z context = <triton._C.libtriton.ir.context object at 0x7f3254133730>
2025-05-07T20:32:58.9406768Z 
2025-05-07T20:32:58.9406983Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9407252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9407364Z                            module_map=module_map)
2025-05-07T20:32:58.9407534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9407635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9407715Z E       ^
2025-05-07T20:32:58.9408073Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9408080Z 
2025-05-07T20:32:58.9408501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9408505Z 
2025-05-07T20:32:58.9408615Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9408835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9408917Z     T=1,
2025-05-07T20:32:58.9409001Z     D=5120,
2025-05-07T20:32:58.9409085Z     scale_ub=None,
2025-05-07T20:32:58.9409218Z     contiguous=False,
2025-05-07T20:32:58.9409310Z     compiled=False,
2025-05-07T20:32:58.9409386Z )
2025-05-07T20:32:58.9409602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9409777Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9409781Z 
2025-05-07T20:32:58.9409859Z     @given(
2025-05-07T20:32:58.9409984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9410089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9410210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9410340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9410455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9410533Z     )
2025-05-07T20:32:58.9410787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9410889Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9410979Z         self,
2025-05-07T20:32:58.9411058Z         T: int,
2025-05-07T20:32:58.9411138Z         D: int,
2025-05-07T20:32:58.9411246Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9411340Z         contiguous: bool,
2025-05-07T20:32:58.9411429Z         compiled: bool,
2025-05-07T20:32:58.9411518Z     ) -> None:
2025-05-07T20:32:58.9411614Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9411688Z     
2025-05-07T20:32:58.9411866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9411993Z     
2025-05-07T20:32:58.9412087Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9412261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9412356Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9412439Z         x0 = x[:, :D]
2025-05-07T20:32:58.9412528Z         x1 = x[:, D:]
2025-05-07T20:32:58.9412604Z     
2025-05-07T20:32:58.9412696Z         if contiguous:
2025-05-07T20:32:58.9412791Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9412884Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9412965Z     
2025-05-07T20:32:58.9413060Z         if scale_ub is not None:
2025-05-07T20:32:58.9413167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9413309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9413388Z             )
2025-05-07T20:32:58.9413467Z         else:
2025-05-07T20:32:58.9413569Z             scale_ub_tensor = None
2025-05-07T20:32:58.9413649Z     
2025-05-07T20:32:58.9413780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9413883Z             op = silu_mul_quant
2025-05-07T20:32:58.9413971Z             if compiled:
2025-05-07T20:32:58.9414078Z                 op = torch.compile(op)
2025-05-07T20:32:58.9414184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9414261Z     
2025-05-07T20:32:58.9414360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9414365Z 
2025-05-07T20:32:58.9414511Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9414646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9414756Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9414857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9415352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9415455Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9415814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9416043Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9416382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9416477Z     kernel = self.compile(
2025-05-07T20:32:58.9416866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9417091Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9417225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9417230Z 
2025-05-07T20:32:58.9417432Z self = <triton.compiler.compiler.ASTSource object at 0x7f31337d48e0>
2025-05-07T20:32:58.9418203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9418704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337c97e0>}
2025-05-07T20:32:58.9419445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9419642Z context = <triton._C.libtriton.ir.context object at 0x7f32541d96b0>
2025-05-07T20:32:58.9419647Z 
2025-05-07T20:32:58.9419918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9420182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9420294Z                            module_map=module_map)
2025-05-07T20:32:58.9420501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9420680Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9420756Z E       ^
2025-05-07T20:32:58.9421105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9421110Z 
2025-05-07T20:32:58.9421534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9421542Z 
2025-05-07T20:32:58.9421647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9421876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9421951Z     T=4096,
2025-05-07T20:32:58.9422029Z     D=7168,
2025-05-07T20:32:58.9422120Z     scale_ub=1200.0,
2025-05-07T20:32:58.9422207Z     contiguous=False,
2025-05-07T20:32:58.9422290Z     compiled=False,
2025-05-07T20:32:58.9422371Z )
2025-05-07T20:32:58.9422589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9422770Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9422775Z 
2025-05-07T20:32:58.9422860Z     @given(
2025-05-07T20:32:58.9422978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9423076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9423239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9423357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9423480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9423555Z     )
2025-05-07T20:32:58.9423798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9423897Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9423971Z         self,
2025-05-07T20:32:58.9424051Z         T: int,
2025-05-07T20:32:58.9424134Z         D: int,
2025-05-07T20:32:58.9424235Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9424326Z         contiguous: bool,
2025-05-07T20:32:58.9424420Z         compiled: bool,
2025-05-07T20:32:58.9424502Z     ) -> None:
2025-05-07T20:32:58.9424604Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9424678Z     
2025-05-07T20:32:58.9424846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9424924Z     
2025-05-07T20:32:58.9425019Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9425143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9425285Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9425380Z         x0 = x[:, :D]
2025-05-07T20:32:58.9435508Z         x1 = x[:, D:]
2025-05-07T20:32:58.9435624Z     
2025-05-07T20:32:58.9435758Z         if contiguous:
2025-05-07T20:32:58.9435872Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9435971Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9436050Z     
2025-05-07T20:32:58.9436155Z         if scale_ub is not None:
2025-05-07T20:32:58.9436280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9436431Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9436524Z             )
2025-05-07T20:32:58.9436607Z         else:
2025-05-07T20:32:58.9436709Z             scale_ub_tensor = None
2025-05-07T20:32:58.9436797Z     
2025-05-07T20:32:58.9436938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9437049Z             op = silu_mul_quant
2025-05-07T20:32:58.9437143Z             if compiled:
2025-05-07T20:32:58.9437256Z                 op = torch.compile(op)
2025-05-07T20:32:58.9437376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9437453Z     
2025-05-07T20:32:58.9437551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9437557Z 
2025-05-07T20:32:58.9437671Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9437808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9438016Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9438132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9438695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9438822Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9439263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9439530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9439961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9440066Z     kernel = self.compile(
2025-05-07T20:32:58.9440543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9440747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9440899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9440904Z 
2025-05-07T20:32:58.9441153Z self = <triton.compiler.compiler.ASTSource object at 0x7f3254124d00>
2025-05-07T20:32:58.9442235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9442883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337ca200>}
2025-05-07T20:32:58.9443820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9444045Z context = <triton._C.libtriton.ir.context object at 0x7f3254178eb0>
2025-05-07T20:32:58.9444053Z 
2025-05-07T20:32:58.9444255Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9444572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9444700Z                            module_map=module_map)
2025-05-07T20:32:58.9444884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9444997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9445152Z E       ^
2025-05-07T20:32:58.9445585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9445590Z 
2025-05-07T20:32:58.9446100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9446111Z 
2025-05-07T20:32:58.9446230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9446504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9446600Z     T=16384,
2025-05-07T20:32:58.9446686Z     D=7168,
2025-05-07T20:32:58.9446778Z     scale_ub=None,
2025-05-07T20:32:58.9446883Z     contiguous=True,
2025-05-07T20:32:58.9446975Z     compiled=True,
2025-05-07T20:32:58.9447065Z )
2025-05-07T20:32:58.9447331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9447533Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9447540Z 
2025-05-07T20:32:58.9447635Z     @given(
2025-05-07T20:32:58.9447771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9447891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9448048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9448205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9448332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9448470Z     )
2025-05-07T20:32:58.9448767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9448910Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9449003Z         self,
2025-05-07T20:32:58.9449089Z         T: int,
2025-05-07T20:32:58.9449175Z         D: int,
2025-05-07T20:32:58.9449298Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9449399Z         contiguous: bool,
2025-05-07T20:32:58.9449505Z         compiled: bool,
2025-05-07T20:32:58.9449593Z     ) -> None:
2025-05-07T20:32:58.9449701Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9449787Z     
2025-05-07T20:32:58.9449978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9450060Z     
2025-05-07T20:32:58.9450171Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9450311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9450408Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9450504Z         x0 = x[:, :D]
2025-05-07T20:32:58.9450593Z         x1 = x[:, D:]
2025-05-07T20:32:58.9450674Z     
2025-05-07T20:32:58.9450774Z         if contiguous:
2025-05-07T20:32:58.9450880Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9450986Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9451067Z     
2025-05-07T20:32:58.9451167Z         if scale_ub is not None:
2025-05-07T20:32:58.9451290Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9451487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9451574Z             )
2025-05-07T20:32:58.9451666Z         else:
2025-05-07T20:32:58.9451772Z             scale_ub_tensor = None
2025-05-07T20:32:58.9451853Z     
2025-05-07T20:32:58.9452007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9452106Z             op = silu_mul_quant
2025-05-07T20:32:58.9452200Z             if compiled:
2025-05-07T20:32:58.9452322Z                 op = torch.compile(op)
2025-05-07T20:32:58.9452479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9452602Z     
2025-05-07T20:32:58.9452706Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9452711Z 
2025-05-07T20:32:58.9452828Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9473863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9474019Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9474133Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9474526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9474756Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9475267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9475372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9475732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9475970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9476316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9476417Z     kernel = self.compile(
2025-05-07T20:32:58.9476811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9476994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9477137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9477142Z 
2025-05-07T20:32:58.9477351Z self = <triton.compiler.compiler.ASTSource object at 0x7f32541b5630>
2025-05-07T20:32:58.9478174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9478787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31337cb760>}
2025-05-07T20:32:58.9479546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9479749Z context = <triton._C.libtriton.ir.context object at 0x7f3133d65cf0>
2025-05-07T20:32:58.9479758Z 
2025-05-07T20:32:58.9479929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9480194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9480314Z                            module_map=module_map)
2025-05-07T20:32:58.9480483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9480603Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9480688Z E       ^
2025-05-07T20:32:58.9481050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9481055Z 
2025-05-07T20:32:58.9481482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9481487Z 
2025-05-07T20:32:58.9481644Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9481879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9481961Z     T=4096,
2025-05-07T20:32:58.9482043Z     D=5120,
2025-05-07T20:32:58.9482135Z     scale_ub=None,
2025-05-07T20:32:58.9482226Z     contiguous=False,
2025-05-07T20:32:58.9482310Z     compiled=True,
2025-05-07T20:32:58.9482398Z )
2025-05-07T20:32:58.9482618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9482796Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9482801Z 
2025-05-07T20:32:58.9482892Z     @given(
2025-05-07T20:32:58.9483017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9483134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9483252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9483373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9483500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9483625Z     )
2025-05-07T20:32:58.9483875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9483985Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9484065Z         self,
2025-05-07T20:32:58.9484145Z         T: int,
2025-05-07T20:32:58.9484231Z         D: int,
2025-05-07T20:32:58.9484334Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9484428Z         contiguous: bool,
2025-05-07T20:32:58.9484535Z         compiled: bool,
2025-05-07T20:32:58.9484618Z     ) -> None:
2025-05-07T20:32:58.9484724Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9484801Z     
2025-05-07T20:32:58.9484973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9485059Z     
2025-05-07T20:32:58.9485154Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9485281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9485385Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9485471Z         x0 = x[:, :D]
2025-05-07T20:32:58.9485557Z         x1 = x[:, D:]
2025-05-07T20:32:58.9485639Z     
2025-05-07T20:32:58.9485730Z         if contiguous:
2025-05-07T20:32:58.9485824Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9485928Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9486007Z     
2025-05-07T20:32:58.9486111Z         if scale_ub is not None:
2025-05-07T20:32:58.9486220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9486356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9486493Z             )
2025-05-07T20:32:58.9486573Z         else:
2025-05-07T20:32:58.9486711Z             scale_ub_tensor = None
2025-05-07T20:32:58.9486793Z     
2025-05-07T20:32:58.9486925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9487016Z             op = silu_mul_quant
2025-05-07T20:32:58.9487110Z             if compiled:
2025-05-07T20:32:58.9487218Z                 op = torch.compile(op)
2025-05-07T20:32:58.9487328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9487413Z     
2025-05-07T20:32:58.9487511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9487516Z 
2025-05-07T20:32:58.9487620Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9487748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9487857Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9487982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9488387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9488485Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9488989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9489087Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9489535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9489767Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9490428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9490534Z     kernel = self.compile(
2025-05-07T20:32:58.9490915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9491098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9491239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9491244Z 
2025-05-07T20:32:58.9491448Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133d443d0>
2025-05-07T20:32:58.9492227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9492873Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2c280>}
2025-05-07T20:32:58.9493616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9493808Z context = <triton._C.libtriton.ir.context object at 0x7f3133d6c7b0>
2025-05-07T20:32:58.9493813Z 
2025-05-07T20:32:58.9493988Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9494257Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9494367Z                            module_map=module_map)
2025-05-07T20:32:58.9494540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9494643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9494720Z E       ^
2025-05-07T20:32:58.9495082Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9495087Z 
2025-05-07T20:32:58.9495495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9495500Z 
2025-05-07T20:32:58.9495698Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9495921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9496062Z     T=4096,
2025-05-07T20:32:58.9496151Z     D=5120,
2025-05-07T20:32:58.9496237Z     scale_ub=1200.0,
2025-05-07T20:32:58.9496323Z     contiguous=False,
2025-05-07T20:32:58.9496417Z     compiled=False,
2025-05-07T20:32:58.9496493Z )
2025-05-07T20:32:58.9496720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9496897Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9496902Z 
2025-05-07T20:32:58.9496979Z     @given(
2025-05-07T20:32:58.9497106Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9497206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9497323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9497450Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9497569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9497647Z     )
2025-05-07T20:32:58.9497901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9497997Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9498079Z         self,
2025-05-07T20:32:58.9498159Z         T: int,
2025-05-07T20:32:58.9498234Z         D: int,
2025-05-07T20:32:58.9498345Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9498528Z         contiguous: bool,
2025-05-07T20:32:58.9498643Z         compiled: bool,
2025-05-07T20:32:58.9498733Z     ) -> None:
2025-05-07T20:32:58.9498832Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9498908Z     
2025-05-07T20:32:58.9499089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9499165Z     
2025-05-07T20:32:58.9499264Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9499388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9499488Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9499567Z         x0 = x[:, :D]
2025-05-07T20:32:58.9499645Z         x1 = x[:, D:]
2025-05-07T20:32:58.9499729Z     
2025-05-07T20:32:58.9499922Z         if contiguous:
2025-05-07T20:32:58.9500015Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9500112Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9500187Z     
2025-05-07T20:32:58.9500279Z         if scale_ub is not None:
2025-05-07T20:32:58.9500396Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9500535Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9500656Z             )
2025-05-07T20:32:58.9500745Z         else:
2025-05-07T20:32:58.9500842Z             scale_ub_tensor = None
2025-05-07T20:32:58.9500920Z     
2025-05-07T20:32:58.9501057Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9501149Z             op = silu_mul_quant
2025-05-07T20:32:58.9501244Z             if compiled:
2025-05-07T20:32:58.9501345Z                 op = torch.compile(op)
2025-05-07T20:32:58.9501456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9501535Z     
2025-05-07T20:32:58.9501629Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9501633Z 
2025-05-07T20:32:58.9501734Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9501870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9501972Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9502076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9502577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9502676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9503040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9503260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9503653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9503795Z     kernel = self.compile(
2025-05-07T20:32:58.9504182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9504368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9504496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9504503Z 
2025-05-07T20:32:58.9504705Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133d449a0>
2025-05-07T20:32:58.9505484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9505987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2d000>}
2025-05-07T20:32:58.9506740Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9506928Z context = <triton._C.libtriton.ir.context object at 0x7f3133985c70>
2025-05-07T20:32:58.9506971Z 
2025-05-07T20:32:58.9507136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9507412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9507519Z                            module_map=module_map)
2025-05-07T20:32:58.9507688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9507787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9507876Z E       ^
2025-05-07T20:32:58.9508270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9508277Z 
2025-05-07T20:32:58.9508694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9508698Z 
2025-05-07T20:32:58.9508807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9509029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9509105Z     T=4096,
2025-05-07T20:32:58.9509231Z     D=5120,
2025-05-07T20:32:58.9509315Z     scale_ub=1200.0,
2025-05-07T20:32:58.9509400Z     contiguous=False,
2025-05-07T20:32:58.9509491Z     compiled=True,
2025-05-07T20:32:58.9509565Z )
2025-05-07T20:32:58.9509779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9509958Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9509963Z 
2025-05-07T20:32:58.9510043Z     @given(
2025-05-07T20:32:58.9510169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9510268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9510385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9510507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9510619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9510693Z     )
2025-05-07T20:32:58.9510948Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9511044Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9511118Z         self,
2025-05-07T20:32:58.9511203Z         T: int,
2025-05-07T20:32:58.9511282Z         D: int,
2025-05-07T20:32:58.9511387Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9511478Z         contiguous: bool,
2025-05-07T20:32:58.9511563Z         compiled: bool,
2025-05-07T20:32:58.9511644Z     ) -> None:
2025-05-07T20:32:58.9511740Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9511862Z     
2025-05-07T20:32:58.9512037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9512152Z     
2025-05-07T20:32:58.9512245Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9512376Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9512467Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9512548Z         x0 = x[:, :D]
2025-05-07T20:32:58.9512636Z         x1 = x[:, D:]
2025-05-07T20:32:58.9512714Z     
2025-05-07T20:32:58.9512797Z         if contiguous:
2025-05-07T20:32:58.9512898Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9512990Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9513069Z     
2025-05-07T20:32:58.9513159Z         if scale_ub is not None:
2025-05-07T20:32:58.9513266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9513406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9513486Z             )
2025-05-07T20:32:58.9513563Z         else:
2025-05-07T20:32:58.9513668Z             scale_ub_tensor = None
2025-05-07T20:32:58.9513741Z     
2025-05-07T20:32:58.9513873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9513970Z             op = silu_mul_quant
2025-05-07T20:32:58.9514056Z             if compiled:
2025-05-07T20:32:58.9514159Z                 op = torch.compile(op)
2025-05-07T20:32:58.9514272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9514346Z     
2025-05-07T20:32:58.9514490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9514498Z 
2025-05-07T20:32:58.9514598Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9514727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9514837Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9514936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9515306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9515412Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9515912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9516017Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9516377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9516600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9517017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9517112Z     kernel = self.compile(
2025-05-07T20:32:58.9517490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9517675Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9517802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9517809Z 
2025-05-07T20:32:58.9518022Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133d1c190>
2025-05-07T20:32:58.9518791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9519304Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2c700>}
2025-05-07T20:32:58.9520046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9520235Z context = <triton._C.libtriton.ir.context object at 0x7f3133996230>
2025-05-07T20:32:58.9520284Z 
2025-05-07T20:32:58.9520456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9520796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9520916Z                            module_map=module_map)
2025-05-07T20:32:58.9521080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9521177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9521270Z E       ^
2025-05-07T20:32:58.9521622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9521630Z 
2025-05-07T20:32:58.9522045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9522057Z 
2025-05-07T20:32:58.9522160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9522379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9522468Z     T=2048,
2025-05-07T20:32:58.9522542Z     D=7168,
2025-05-07T20:32:58.9522625Z     scale_ub=1200.0,
2025-05-07T20:32:58.9522723Z     contiguous=False,
2025-05-07T20:32:58.9522805Z     compiled=False,
2025-05-07T20:32:58.9522876Z )
2025-05-07T20:32:58.9523096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9523315Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9523320Z 
2025-05-07T20:32:58.9523400Z     @given(
2025-05-07T20:32:58.9523526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9523624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9523746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9523862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9523973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9524054Z     )
2025-05-07T20:32:58.9524301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9524395Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9524477Z         self,
2025-05-07T20:32:58.9524552Z         T: int,
2025-05-07T20:32:58.9524629Z         D: int,
2025-05-07T20:32:58.9524735Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9524825Z         contiguous: bool,
2025-05-07T20:32:58.9524918Z         compiled: bool,
2025-05-07T20:32:58.9524997Z     ) -> None:
2025-05-07T20:32:58.9525093Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9525222Z     
2025-05-07T20:32:58.9525391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9525466Z     
2025-05-07T20:32:58.9525563Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9525686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9525775Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9525861Z         x0 = x[:, :D]
2025-05-07T20:32:58.9525940Z         x1 = x[:, D:]
2025-05-07T20:32:58.9526015Z     
2025-05-07T20:32:58.9526104Z         if contiguous:
2025-05-07T20:32:58.9526196Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9526286Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9526363Z     
2025-05-07T20:32:58.9526453Z         if scale_ub is not None:
2025-05-07T20:32:58.9526563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9526697Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9526780Z             )
2025-05-07T20:32:58.9526866Z         else:
2025-05-07T20:32:58.9526962Z             scale_ub_tensor = None
2025-05-07T20:32:58.9527036Z     
2025-05-07T20:32:58.9527168Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9527260Z             op = silu_mul_quant
2025-05-07T20:32:58.9527345Z             if compiled:
2025-05-07T20:32:58.9527451Z                 op = torch.compile(op)
2025-05-07T20:32:58.9527557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9527676Z     
2025-05-07T20:32:58.9527776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9527780Z 
2025-05-07T20:32:58.9527879Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9528053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9528158Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9528257Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9528766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9528866Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9529222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9529449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9529787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9529894Z     kernel = self.compile(
2025-05-07T20:32:58.9530277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9530451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9530585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9530589Z 
2025-05-07T20:32:58.9530834Z self = <triton.compiler.compiler.ASTSource object at 0x7f31339b0430>
2025-05-07T20:32:58.9531612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9532121Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2d240>}
2025-05-07T20:32:58.9532864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9533060Z context = <triton._C.libtriton.ir.context object at 0x7f31339d7570>
2025-05-07T20:32:58.9533065Z 
2025-05-07T20:32:58.9533229Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9533505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9533654Z                            module_map=module_map)
2025-05-07T20:32:58.9533818Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9533923Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9534006Z E       ^
2025-05-07T20:32:58.9534368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9534375Z 
2025-05-07T20:32:58.9534792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9534797Z 
2025-05-07T20:32:58.9534899Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9535128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9535206Z     T=1,
2025-05-07T20:32:58.9535284Z     D=7168,
2025-05-07T20:32:58.9535381Z     scale_ub=None,
2025-05-07T20:32:58.9535467Z     contiguous=True,
2025-05-07T20:32:58.9535560Z     compiled=False,
2025-05-07T20:32:58.9535637Z )
2025-05-07T20:32:58.9535850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9536021Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9536026Z 
2025-05-07T20:32:58.9536104Z     @given(
2025-05-07T20:32:58.9536223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9536330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9536491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9536650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9536773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9536849Z     )
2025-05-07T20:32:58.9537099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9537193Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9537294Z         self,
2025-05-07T20:32:58.9537379Z         T: int,
2025-05-07T20:32:58.9537457Z         D: int,
2025-05-07T20:32:58.9537566Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9537657Z         contiguous: bool,
2025-05-07T20:32:58.9537745Z         compiled: bool,
2025-05-07T20:32:58.9537829Z     ) -> None:
2025-05-07T20:32:58.9537924Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9538006Z     
2025-05-07T20:32:58.9538181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9538257Z     
2025-05-07T20:32:58.9538351Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9538485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9538575Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9538657Z         x0 = x[:, :D]
2025-05-07T20:32:58.9538743Z         x1 = x[:, D:]
2025-05-07T20:32:58.9538818Z     
2025-05-07T20:32:58.9538904Z         if contiguous:
2025-05-07T20:32:58.9539048Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9539139Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9539223Z     
2025-05-07T20:32:58.9539313Z         if scale_ub is not None:
2025-05-07T20:32:58.9539419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9539559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9539635Z             )
2025-05-07T20:32:58.9539710Z         else:
2025-05-07T20:32:58.9539958Z             scale_ub_tensor = None
2025-05-07T20:32:58.9540038Z     
2025-05-07T20:32:58.9540171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9540270Z             op = silu_mul_quant
2025-05-07T20:32:58.9540359Z             if compiled:
2025-05-07T20:32:58.9540460Z                 op = torch.compile(op)
2025-05-07T20:32:58.9540575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9540650Z     
2025-05-07T20:32:58.9540749Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9540754Z 
2025-05-07T20:32:58.9540856Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9540983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9541140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9541238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9541739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9541841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9542195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9542429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9542768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9542860Z     kernel = self.compile(
2025-05-07T20:32:58.9543247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9543432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9543556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9543568Z 
2025-05-07T20:32:58.9543773Z self = <triton.compiler.compiler.ASTSource object at 0x7f32541b9f00>
2025-05-07T20:32:58.9544542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9545130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2e050>}
2025-05-07T20:32:58.9545884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9546081Z context = <triton._C.libtriton.ir.context object at 0x7f31338cf1f0>
2025-05-07T20:32:58.9546085Z 
2025-05-07T20:32:58.9546249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9546514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9546627Z                            module_map=module_map)
2025-05-07T20:32:58.9546794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9546898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9546975Z E       ^
2025-05-07T20:32:58.9547326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9547332Z 
2025-05-07T20:32:58.9547796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9547801Z 
2025-05-07T20:32:58.9547908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9548127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9548215Z     T=16384,
2025-05-07T20:32:58.9548292Z     D=7168,
2025-05-07T20:32:58.9548383Z     scale_ub=1200.0,
2025-05-07T20:32:58.9548470Z     contiguous=False,
2025-05-07T20:32:58.9548552Z     compiled=True,
2025-05-07T20:32:58.9548634Z )
2025-05-07T20:32:58.9548850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9549036Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9549040Z 
2025-05-07T20:32:58.9549119Z     @given(
2025-05-07T20:32:58.9549235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9549336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9549455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9549576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9549746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9549821Z     )
2025-05-07T20:32:58.9550070Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9550170Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9550247Z         self,
2025-05-07T20:32:58.9550323Z         T: int,
2025-05-07T20:32:58.9550409Z         D: int,
2025-05-07T20:32:58.9550507Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9550603Z         contiguous: bool,
2025-05-07T20:32:58.9550695Z         compiled: bool,
2025-05-07T20:32:58.9550772Z     ) -> None:
2025-05-07T20:32:58.9550869Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9550953Z     
2025-05-07T20:32:58.9551120Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9551204Z     
2025-05-07T20:32:58.9551298Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9551427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9551528Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9551609Z         x0 = x[:, :D]
2025-05-07T20:32:58.9551691Z         x1 = x[:, D:]
2025-05-07T20:32:58.9551775Z     
2025-05-07T20:32:58.9551860Z         if contiguous:
2025-05-07T20:32:58.9551953Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9552049Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9552123Z     
2025-05-07T20:32:58.9552212Z         if scale_ub is not None:
2025-05-07T20:32:58.9552408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9552543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9552670Z             )
2025-05-07T20:32:58.9552747Z         else:
2025-05-07T20:32:58.9552844Z             scale_ub_tensor = None
2025-05-07T20:32:58.9552923Z     
2025-05-07T20:32:58.9553055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9553145Z             op = silu_mul_quant
2025-05-07T20:32:58.9553241Z             if compiled:
2025-05-07T20:32:58.9553343Z                 op = torch.compile(op)
2025-05-07T20:32:58.9553452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9553539Z     
2025-05-07T20:32:58.9553629Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9553634Z 
2025-05-07T20:32:58.9553732Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9553866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9553967Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9554079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9554447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9554541Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9555047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9555189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9555549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9555784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9556126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9556226Z     kernel = self.compile(
2025-05-07T20:32:58.9556610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9556792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9556922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9556926Z 
2025-05-07T20:32:58.9557131Z self = <triton.compiler.compiler.ASTSource object at 0x7f31338e89d0>
2025-05-07T20:32:58.9557909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9558466Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2f490>}
2025-05-07T20:32:58.9559327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9559536Z context = <triton._C.libtriton.ir.context object at 0x7f31338cbb70>
2025-05-07T20:32:58.9559541Z 
2025-05-07T20:32:58.9559706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9559978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9560086Z                            module_map=module_map)
2025-05-07T20:32:58.9560252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9560356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9560433Z E       ^
2025-05-07T20:32:58.9560790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9560795Z 
2025-05-07T20:32:58.9561202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9561262Z 
2025-05-07T20:32:58.9561370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9561638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9561711Z     T=1,
2025-05-07T20:32:58.9561788Z     D=7168,
2025-05-07T20:32:58.9561875Z     scale_ub=None,
2025-05-07T20:32:58.9561962Z     contiguous=False,
2025-05-07T20:32:58.9562049Z     compiled=False,
2025-05-07T20:32:58.9562125Z )
2025-05-07T20:32:58.9562338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9562514Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9562518Z 
2025-05-07T20:32:58.9562593Z     @given(
2025-05-07T20:32:58.9562710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9562816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9562931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9563054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9563175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9563252Z     )
2025-05-07T20:32:58.9563503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9563595Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9563671Z         self,
2025-05-07T20:32:58.9563752Z         T: int,
2025-05-07T20:32:58.9563827Z         D: int,
2025-05-07T20:32:58.9563965Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9564064Z         contiguous: bool,
2025-05-07T20:32:58.9564148Z         compiled: bool,
2025-05-07T20:32:58.9564223Z     ) -> None:
2025-05-07T20:32:58.9564324Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9564399Z     
2025-05-07T20:32:58.9564568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9564648Z     
2025-05-07T20:32:58.9564740Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9564871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9564965Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9565043Z         x0 = x[:, :D]
2025-05-07T20:32:58.9565133Z         x1 = x[:, D:]
2025-05-07T20:32:58.9565206Z     
2025-05-07T20:32:58.9565288Z         if contiguous:
2025-05-07T20:32:58.9565390Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9565478Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9565552Z     
2025-05-07T20:32:58.9565652Z         if scale_ub is not None:
2025-05-07T20:32:58.9565804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9565937Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9566023Z             )
2025-05-07T20:32:58.9566102Z         else:
2025-05-07T20:32:58.9566202Z             scale_ub_tensor = None
2025-05-07T20:32:58.9566276Z     
2025-05-07T20:32:58.9566405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9566506Z             op = silu_mul_quant
2025-05-07T20:32:58.9566597Z             if compiled:
2025-05-07T20:32:58.9566694Z                 op = torch.compile(op)
2025-05-07T20:32:58.9566807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9566876Z     
2025-05-07T20:32:58.9566967Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9566972Z 
2025-05-07T20:32:58.9567073Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9567199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9567302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9567410Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9567906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9568009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9568363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9568582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9569011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9569105Z     kernel = self.compile(
2025-05-07T20:32:58.9569504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9569757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9569918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9569927Z 
2025-05-07T20:32:58.9570142Z self = <triton.compiler.compiler.ASTSource object at 0x7f31339746d0>
2025-05-07T20:32:58.9570910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9571420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133d2f7f0>}
2025-05-07T20:32:58.9572158Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9572405Z context = <triton._C.libtriton.ir.context object at 0x7f31334a57b0>
2025-05-07T20:32:58.9572416Z 
2025-05-07T20:32:58.9572592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9572856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9572971Z                            module_map=module_map)
2025-05-07T20:32:58.9573132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9573228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9573314Z E       ^
2025-05-07T20:32:58.9573667Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9573672Z 
2025-05-07T20:32:58.9574094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9574098Z 
2025-05-07T20:32:58.9574202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9574425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9574555Z     T=2048,
2025-05-07T20:32:58.9574632Z     D=7168,
2025-05-07T20:32:58.9574712Z     scale_ub=None,
2025-05-07T20:32:58.9574805Z     contiguous=False,
2025-05-07T20:32:58.9574885Z     compiled=True,
2025-05-07T20:32:58.9574955Z )
2025-05-07T20:32:58.9575177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9575348Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9575354Z 
2025-05-07T20:32:58.9575438Z     @given(
2025-05-07T20:32:58.9575559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9575660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9575781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9575896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9576010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9576094Z     )
2025-05-07T20:32:58.9576344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9576438Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9576522Z         self,
2025-05-07T20:32:58.9576595Z         T: int,
2025-05-07T20:32:58.9576676Z         D: int,
2025-05-07T20:32:58.9576773Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9576863Z         contiguous: bool,
2025-05-07T20:32:58.9576955Z         compiled: bool,
2025-05-07T20:32:58.9577082Z     ) -> None:
2025-05-07T20:32:58.9577176Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9577254Z     
2025-05-07T20:32:58.9577460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9577537Z     
2025-05-07T20:32:58.9577637Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9577763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9577852Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9577940Z         x0 = x[:, :D]
2025-05-07T20:32:58.9578019Z         x1 = x[:, D:]
2025-05-07T20:32:58.9578094Z     
2025-05-07T20:32:58.9578183Z         if contiguous:
2025-05-07T20:32:58.9578275Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9578370Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9578446Z     
2025-05-07T20:32:58.9578537Z         if scale_ub is not None:
2025-05-07T20:32:58.9578650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9578786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9578868Z             )
2025-05-07T20:32:58.9578953Z         else:
2025-05-07T20:32:58.9579047Z             scale_ub_tensor = None
2025-05-07T20:32:58.9579125Z     
2025-05-07T20:32:58.9579264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9579352Z             op = silu_mul_quant
2025-05-07T20:32:58.9585706Z             if compiled:
2025-05-07T20:32:58.9585835Z                 op = torch.compile(op)
2025-05-07T20:32:58.9586049Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9586138Z     
2025-05-07T20:32:58.9586235Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9586241Z 
2025-05-07T20:32:58.9586352Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9586486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9586593Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9586708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9587085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9587186Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9587696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9587800Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9588171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9588396Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9588828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9588934Z     kernel = self.compile(
2025-05-07T20:32:58.9589325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9589504Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9589644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9589651Z 
2025-05-07T20:32:58.9590766Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133811270>
2025-05-07T20:32:58.9592155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9592918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d4af0>}
2025-05-07T20:32:58.9594013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9594287Z context = <triton._C.libtriton.ir.context object at 0x7f31334aa8b0>
2025-05-07T20:32:58.9594590Z 
2025-05-07T20:32:58.9594935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9595225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9595338Z                            module_map=module_map)
2025-05-07T20:32:58.9595506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9595618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9595699Z E       ^
2025-05-07T20:32:58.9596065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9596071Z 
2025-05-07T20:32:58.9596487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9596492Z 
2025-05-07T20:32:58.9596596Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9596834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9596911Z     T=4096,
2025-05-07T20:32:58.9597000Z     D=7168,
2025-05-07T20:32:58.9597082Z     scale_ub=None,
2025-05-07T20:32:58.9597170Z     contiguous=False,
2025-05-07T20:32:58.9597260Z     compiled=True,
2025-05-07T20:32:58.9597332Z )
2025-05-07T20:32:58.9597549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9597812Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9597821Z 
2025-05-07T20:32:58.9597900Z     @given(
2025-05-07T20:32:58.9598022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9598131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9598246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9598371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9598485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9598562Z     )
2025-05-07T20:32:58.9598815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9598914Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9598990Z         self,
2025-05-07T20:32:58.9599077Z         T: int,
2025-05-07T20:32:58.9599153Z         D: int,
2025-05-07T20:32:58.9599253Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9599352Z         contiguous: bool,
2025-05-07T20:32:58.9599441Z         compiled: bool,
2025-05-07T20:32:58.9599601Z     ) -> None:
2025-05-07T20:32:58.9599705Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9599779Z     
2025-05-07T20:32:58.9599951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9600032Z     
2025-05-07T20:32:58.9600128Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9600260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9600351Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9600435Z         x0 = x[:, :D]
2025-05-07T20:32:58.9600527Z         x1 = x[:, D:]
2025-05-07T20:32:58.9600603Z     
2025-05-07T20:32:58.9600693Z         if contiguous:
2025-05-07T20:32:58.9600798Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9600888Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9600960Z     
2025-05-07T20:32:58.9601060Z         if scale_ub is not None:
2025-05-07T20:32:58.9601166Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9601305Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9601395Z             )
2025-05-07T20:32:58.9601472Z         else:
2025-05-07T20:32:58.9601580Z             scale_ub_tensor = None
2025-05-07T20:32:58.9601652Z     
2025-05-07T20:32:58.9601785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9601888Z             op = silu_mul_quant
2025-05-07T20:32:58.9601974Z             if compiled:
2025-05-07T20:32:58.9602076Z                 op = torch.compile(op)
2025-05-07T20:32:58.9602236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9602310Z     
2025-05-07T20:32:58.9602401Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9602472Z 
2025-05-07T20:32:58.9602580Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9602709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9602816Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9602918Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9603294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9603401Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9603892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9603990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9604356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9604587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9604944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9605043Z     kernel = self.compile(
2025-05-07T20:32:58.9605429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9605656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9605790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9605796Z 
2025-05-07T20:32:58.9606000Z self = <triton.compiler.compiler.ASTSource object at 0x7f31338d54b0>
2025-05-07T20:32:58.9606776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9607284Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d4280>}
2025-05-07T20:32:58.9608032Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9608222Z context = <triton._C.libtriton.ir.context object at 0x7f3133549430>
2025-05-07T20:32:58.9608266Z 
2025-05-07T20:32:58.9608441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9608709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9608819Z                            module_map=module_map)
2025-05-07T20:32:58.9608991Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9609093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9609169Z E       ^
2025-05-07T20:32:58.9609533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9609538Z 
2025-05-07T20:32:58.9609953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9609958Z 
2025-05-07T20:32:58.9610075Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9610297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9610375Z     T=16384,
2025-05-07T20:32:58.9610462Z     D=5120,
2025-05-07T20:32:58.9610544Z     scale_ub=1200.0,
2025-05-07T20:32:58.9610631Z     contiguous=False,
2025-05-07T20:32:58.9610725Z     compiled=False,
2025-05-07T20:32:58.9610799Z )
2025-05-07T20:32:58.9611021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9611245Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9611250Z 
2025-05-07T20:32:58.9611367Z     @given(
2025-05-07T20:32:58.9611497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9611601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9611715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9611841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9611957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9612047Z     )
2025-05-07T20:32:58.9612297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9612394Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9612479Z         self,
2025-05-07T20:32:58.9612554Z         T: int,
2025-05-07T20:32:58.9612632Z         D: int,
2025-05-07T20:32:58.9612743Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9612846Z         contiguous: bool,
2025-05-07T20:32:58.9612934Z         compiled: bool,
2025-05-07T20:32:58.9613009Z     ) -> None:
2025-05-07T20:32:58.9613108Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9613185Z     
2025-05-07T20:32:58.9613352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9613433Z     
2025-05-07T20:32:58.9613525Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9613646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9613790Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9613874Z         x0 = x[:, :D]
2025-05-07T20:32:58.9613954Z         x1 = x[:, D:]
2025-05-07T20:32:58.9614031Z     
2025-05-07T20:32:58.9614113Z         if contiguous:
2025-05-07T20:32:58.9614213Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9614304Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9614377Z     
2025-05-07T20:32:58.9614474Z         if scale_ub is not None:
2025-05-07T20:32:58.9614579Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9614716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9614794Z             )
2025-05-07T20:32:58.9614874Z         else:
2025-05-07T20:32:58.9614971Z             scale_ub_tensor = None
2025-05-07T20:32:58.9615049Z     
2025-05-07T20:32:58.9615181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9615270Z             op = silu_mul_quant
2025-05-07T20:32:58.9615364Z             if compiled:
2025-05-07T20:32:58.9615467Z                 op = torch.compile(op)
2025-05-07T20:32:58.9615656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9615730Z     
2025-05-07T20:32:58.9615821Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9615826Z 
2025-05-07T20:32:58.9615931Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9616059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9616160Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9616266Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9616780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9616877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9617247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9617475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9617826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9617931Z     kernel = self.compile(
2025-05-07T20:32:58.9618351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9618532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9618659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9618711Z 
2025-05-07T20:32:58.9618921Z self = <triton.compiler.compiler.ASTSource object at 0x7f31334b5570>
2025-05-07T20:32:58.9619731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9620350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d6d40>}
2025-05-07T20:32:58.9621103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9621289Z context = <triton._C.libtriton.ir.context object at 0x7f313356a6b0>
2025-05-07T20:32:58.9621294Z 
2025-05-07T20:32:58.9621464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9621736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9621844Z                            module_map=module_map)
2025-05-07T20:32:58.9622014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9622114Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9622196Z E       ^
2025-05-07T20:32:58.9622595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9622603Z 
2025-05-07T20:32:58.9623021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9623026Z 
2025-05-07T20:32:58.9623136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9623356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9623444Z     T=16384,
2025-05-07T20:32:58.9623522Z     D=5120,
2025-05-07T20:32:58.9623606Z     scale_ub=1200.0,
2025-05-07T20:32:58.9623698Z     contiguous=True,
2025-05-07T20:32:58.9623783Z     compiled=True,
2025-05-07T20:32:58.9623856Z )
2025-05-07T20:32:58.9624075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9624248Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9624253Z 
2025-05-07T20:32:58.9624331Z     @given(
2025-05-07T20:32:58.9624496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9624595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9624719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9624837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9624951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9625034Z     )
2025-05-07T20:32:58.9625279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9625375Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9625459Z         self,
2025-05-07T20:32:58.9625541Z         T: int,
2025-05-07T20:32:58.9625615Z         D: int,
2025-05-07T20:32:58.9625722Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9625812Z         contiguous: bool,
2025-05-07T20:32:58.9625897Z         compiled: bool,
2025-05-07T20:32:58.9625982Z     ) -> None:
2025-05-07T20:32:58.9626077Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9626150Z     
2025-05-07T20:32:58.9626333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9626406Z     
2025-05-07T20:32:58.9626509Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9626632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9626720Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9626806Z         x0 = x[:, :D]
2025-05-07T20:32:58.9626884Z         x1 = x[:, D:]
2025-05-07T20:32:58.9626955Z     
2025-05-07T20:32:58.9627093Z         if contiguous:
2025-05-07T20:32:58.9627182Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9627310Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9627388Z     
2025-05-07T20:32:58.9627480Z         if scale_ub is not None:
2025-05-07T20:32:58.9627588Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9627729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9627809Z             )
2025-05-07T20:32:58.9627892Z         else:
2025-05-07T20:32:58.9627989Z             scale_ub_tensor = None
2025-05-07T20:32:58.9628063Z     
2025-05-07T20:32:58.9628198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9628289Z             op = silu_mul_quant
2025-05-07T20:32:58.9628375Z             if compiled:
2025-05-07T20:32:58.9628482Z                 op = torch.compile(op)
2025-05-07T20:32:58.9628588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9628661Z     
2025-05-07T20:32:58.9628757Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9628765Z 
2025-05-07T20:32:58.9628861Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9628997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9629098Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9629197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9629606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9629701Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9630205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9630307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9630661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9630890Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9631237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9631334Z     kernel = self.compile(
2025-05-07T20:32:58.9631722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9631899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9632026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9632082Z 
2025-05-07T20:32:58.9632288Z self = <triton.compiler.compiler.ASTSource object at 0x7f313342a920>
2025-05-07T20:32:58.9633055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9633565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d6830>}
2025-05-07T20:32:58.9634307Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9634501Z context = <triton._C.libtriton.ir.context object at 0x7f313357f070>
2025-05-07T20:32:58.9634508Z 
2025-05-07T20:32:58.9634676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9634944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9635057Z                            module_map=module_map)
2025-05-07T20:32:58.9635223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9635318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9635400Z E       ^
2025-05-07T20:32:58.9635795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9635838Z 
2025-05-07T20:32:58.9636256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9636261Z 
2025-05-07T20:32:58.9636365Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9636586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9636673Z     T=16384,
2025-05-07T20:32:58.9636749Z     D=5120,
2025-05-07T20:32:58.9636836Z     scale_ub=None,
2025-05-07T20:32:58.9636922Z     contiguous=False,
2025-05-07T20:32:58.9637001Z     compiled=True,
2025-05-07T20:32:58.9637078Z )
2025-05-07T20:32:58.9637294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9637471Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9637476Z 
2025-05-07T20:32:58.9637563Z     @given(
2025-05-07T20:32:58.9637680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9637784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9637907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9638025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9638148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9638221Z     )
2025-05-07T20:32:58.9638507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9638615Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9638689Z         self,
2025-05-07T20:32:58.9638763Z         T: int,
2025-05-07T20:32:58.9638850Z         D: int,
2025-05-07T20:32:58.9638948Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9639036Z         contiguous: bool,
2025-05-07T20:32:58.9639126Z         compiled: bool,
2025-05-07T20:32:58.9639205Z     ) -> None:
2025-05-07T20:32:58.9639303Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9639384Z     
2025-05-07T20:32:58.9639554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9639630Z     
2025-05-07T20:32:58.9639735Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9639859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9639955Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9640036Z         x0 = x[:, :D]
2025-05-07T20:32:58.9640119Z         x1 = x[:, D:]
2025-05-07T20:32:58.9640199Z     
2025-05-07T20:32:58.9640328Z         if contiguous:
2025-05-07T20:32:58.9640420Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9640517Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9640591Z     
2025-05-07T20:32:58.9640681Z         if scale_ub is not None:
2025-05-07T20:32:58.9640795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9640930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9641008Z             )
2025-05-07T20:32:58.9641097Z         else:
2025-05-07T20:32:58.9641194Z             scale_ub_tensor = None
2025-05-07T20:32:58.9641273Z     
2025-05-07T20:32:58.9641404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9641492Z             op = silu_mul_quant
2025-05-07T20:32:58.9641580Z             if compiled:
2025-05-07T20:32:58.9641679Z                 op = torch.compile(op)
2025-05-07T20:32:58.9641783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9641863Z     
2025-05-07T20:32:58.9641952Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9641960Z 
2025-05-07T20:32:58.9642061Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9642194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9642294Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9642399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9642760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9642898Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9643434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9643533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9643887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9644120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9644467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9644568Z     kernel = self.compile(
2025-05-07T20:32:58.9644946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9645120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9645256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9645261Z 
2025-05-07T20:32:58.9645470Z self = <triton.compiler.compiler.ASTSource object at 0x7f31333d7a00>
2025-05-07T20:32:58.9646302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9646802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31334d7760>}
2025-05-07T20:32:58.9647541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9647735Z context = <triton._C.libtriton.ir.context object at 0x7f3133315ab0>
2025-05-07T20:32:58.9647743Z 
2025-05-07T20:32:58.9647907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9648179Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9648285Z                            module_map=module_map)
2025-05-07T20:32:58.9648445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9648550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9648628Z E       ^
2025-05-07T20:32:58.9649023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9649035Z 
2025-05-07T20:32:58.9649450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9649454Z 
2025-05-07T20:32:58.9649559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9649784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9649861Z     T=2048,
2025-05-07T20:32:58.9649933Z     D=5120,
2025-05-07T20:32:58.9650020Z     scale_ub=None,
2025-05-07T20:32:58.9650109Z     contiguous=False,
2025-05-07T20:32:58.9650190Z     compiled=True,
2025-05-07T20:32:58.9650268Z )
2025-05-07T20:32:58.9650481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9650661Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9650666Z 
2025-05-07T20:32:58.9650743Z     @given(
2025-05-07T20:32:58.9650860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9650982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9651097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9651212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9651331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9651406Z     )
2025-05-07T20:32:58.9651697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9651796Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9651910Z         self,
2025-05-07T20:32:58.9651995Z         T: int,
2025-05-07T20:32:58.9652070Z         D: int,
2025-05-07T20:32:58.9652169Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9652266Z         contiguous: bool,
2025-05-07T20:32:58.9652351Z         compiled: bool,
2025-05-07T20:32:58.9652429Z     ) -> None:
2025-05-07T20:32:58.9652531Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9652603Z     
2025-05-07T20:32:58.9652771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9652849Z     
2025-05-07T20:32:58.9652939Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9653061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9653156Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9653236Z         x0 = x[:, :D]
2025-05-07T20:32:58.9653327Z         x1 = x[:, D:]
2025-05-07T20:32:58.9653401Z     
2025-05-07T20:32:58.9653486Z         if contiguous:
2025-05-07T20:32:58.9653587Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9653677Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9653750Z     
2025-05-07T20:32:58.9653848Z         if scale_ub is not None:
2025-05-07T20:32:58.9653954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9654144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9654230Z             )
2025-05-07T20:32:58.9654309Z         else:
2025-05-07T20:32:58.9654402Z             scale_ub_tensor = None
2025-05-07T20:32:58.9654481Z     
2025-05-07T20:32:58.9654613Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9654702Z             op = silu_mul_quant
2025-05-07T20:32:58.9654794Z             if compiled:
2025-05-07T20:32:58.9654892Z                 op = torch.compile(op)
2025-05-07T20:32:58.9655004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9655081Z     
2025-05-07T20:32:58.9655171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9655175Z 
2025-05-07T20:32:58.9655281Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9655408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9655508Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9655610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9655977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9656145Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9656632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9656731Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9657090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9657309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9657655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9657755Z     kernel = self.compile(
2025-05-07T20:32:58.9658136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9658316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9658444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9658448Z 
2025-05-07T20:32:58.9658651Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133506350>
2025-05-07T20:32:58.9659422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9660112Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31333783a0>}
2025-05-07T20:32:58.9660875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9661066Z context = <triton._C.libtriton.ir.context object at 0x7f31333dc630>
2025-05-07T20:32:58.9661073Z 
2025-05-07T20:32:58.9661242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9661508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9661615Z                            module_map=module_map)
2025-05-07T20:32:58.9661780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9661877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9661958Z E       ^
2025-05-07T20:32:58.9662323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9662328Z 
2025-05-07T20:32:58.9662742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9662747Z 
2025-05-07T20:32:58.9662861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9663125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9663207Z     T=2048,
2025-05-07T20:32:58.9663290Z     D=5120,
2025-05-07T20:32:58.9663374Z     scale_ub=1200.0,
2025-05-07T20:32:58.9663461Z     contiguous=False,
2025-05-07T20:32:58.9663551Z     compiled=True,
2025-05-07T20:32:58.9663624Z )
2025-05-07T20:32:58.9663839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9664021Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9664029Z 
2025-05-07T20:32:58.9664107Z     @given(
2025-05-07T20:32:58.9664236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9664337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9664455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9664581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9664701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9664776Z     )
2025-05-07T20:32:58.9665071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9665166Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9665244Z         self,
2025-05-07T20:32:58.9665326Z         T: int,
2025-05-07T20:32:58.9665402Z         D: int,
2025-05-07T20:32:58.9665509Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9665600Z         contiguous: bool,
2025-05-07T20:32:58.9665686Z         compiled: bool,
2025-05-07T20:32:58.9665774Z     ) -> None:
2025-05-07T20:32:58.9665872Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9665949Z     
2025-05-07T20:32:58.9666126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9666203Z     
2025-05-07T20:32:58.9666296Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9666427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9666518Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9666603Z         x0 = x[:, :D]
2025-05-07T20:32:58.9666695Z         x1 = x[:, D:]
2025-05-07T20:32:58.9666768Z     
2025-05-07T20:32:58.9666859Z         if contiguous:
2025-05-07T20:32:58.9666952Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9667042Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9667122Z     
2025-05-07T20:32:58.9667213Z         if scale_ub is not None:
2025-05-07T20:32:58.9667319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9667461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9667584Z             )
2025-05-07T20:32:58.9667663Z         else:
2025-05-07T20:32:58.9667805Z             scale_ub_tensor = None
2025-05-07T20:32:58.9667879Z     
2025-05-07T20:32:58.9668011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9668109Z             op = silu_mul_quant
2025-05-07T20:32:58.9668196Z             if compiled:
2025-05-07T20:32:58.9668298Z                 op = torch.compile(op)
2025-05-07T20:32:58.9668415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9668491Z     
2025-05-07T20:32:58.9668589Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9668593Z 
2025-05-07T20:32:58.9668692Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9668820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9668930Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9669028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9669393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9669498Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9669999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9670105Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9670503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9670735Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9671086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9671181Z     kernel = self.compile(
2025-05-07T20:32:58.9671560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9671742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9671872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9671879Z 
2025-05-07T20:32:58.9672092Z self = <triton.compiler.compiler.ASTSource object at 0x7f31333d6680>
2025-05-07T20:32:58.9672867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9673420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133378820>}
2025-05-07T20:32:58.9674158Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9674354Z context = <triton._C.libtriton.ir.context object at 0x7f31332c6fb0>
2025-05-07T20:32:58.9674358Z 
2025-05-07T20:32:58.9674531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9674796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9674909Z                            module_map=module_map)
2025-05-07T20:32:58.9675075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9675175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9675262Z E       ^
2025-05-07T20:32:58.9675614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9675619Z 
2025-05-07T20:32:58.9676035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9676046Z 
2025-05-07T20:32:58.9676151Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9676417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9676500Z     T=4096,
2025-05-07T20:32:58.9676639Z     D=5120,
2025-05-07T20:32:58.9676725Z     scale_ub=1200.0,
2025-05-07T20:32:58.9676818Z     contiguous=True,
2025-05-07T20:32:58.9676901Z     compiled=True,
2025-05-07T20:32:58.9676974Z )
2025-05-07T20:32:58.9677196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9677371Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9677379Z 
2025-05-07T20:32:58.9677463Z     @given(
2025-05-07T20:32:58.9677583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9677683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9677802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9677918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9678032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9678117Z     )
2025-05-07T20:32:58.9678370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9678464Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9678547Z         self,
2025-05-07T20:32:58.9678624Z         T: int,
2025-05-07T20:32:58.9678701Z         D: int,
2025-05-07T20:32:58.9678807Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9678897Z         contiguous: bool,
2025-05-07T20:32:58.9679032Z         compiled: bool,
2025-05-07T20:32:58.9679116Z     ) -> None:
2025-05-07T20:32:58.9679211Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9679291Z     
2025-05-07T20:32:58.9679462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9679540Z     
2025-05-07T20:32:58.9679640Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9679765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9679854Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9679943Z         x0 = x[:, :D]
2025-05-07T20:32:58.9680025Z         x1 = x[:, D:]
2025-05-07T20:32:58.9680100Z     
2025-05-07T20:32:58.9680194Z         if contiguous:
2025-05-07T20:32:58.9680287Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9680383Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9680454Z     
2025-05-07T20:32:58.9680550Z         if scale_ub is not None:
2025-05-07T20:32:58.9680664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9680801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9680925Z             )
2025-05-07T20:32:58.9681011Z         else:
2025-05-07T20:32:58.9681106Z             scale_ub_tensor = None
2025-05-07T20:32:58.9681181Z     
2025-05-07T20:32:58.9681318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9681410Z             op = silu_mul_quant
2025-05-07T20:32:58.9681497Z             if compiled:
2025-05-07T20:32:58.9681605Z                 op = torch.compile(op)
2025-05-07T20:32:58.9681717Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9681790Z     
2025-05-07T20:32:58.9681888Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9681897Z 
2025-05-07T20:32:58.9681998Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9682132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9682233Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9682331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9682707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9682807Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9683297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9683403Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9683759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9684037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9684417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9684513Z     kernel = self.compile(
2025-05-07T20:32:58.9684904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9685081Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9685216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9685221Z 
2025-05-07T20:32:58.9685424Z self = <triton.compiler.compiler.ASTSource object at 0x7f31332b9b40>
2025-05-07T20:32:58.9686192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9686698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133379360>}
2025-05-07T20:32:58.9687477Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9687676Z context = <triton._C.libtriton.ir.context object at 0x7f3133201b30>
2025-05-07T20:32:58.9687680Z 
2025-05-07T20:32:58.9687848Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9688140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9688266Z                            module_map=module_map)
2025-05-07T20:32:58.9688445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9688555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9688632Z E       ^
2025-05-07T20:32:58.9688988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9688993Z 
2025-05-07T20:32:58.9689418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9689426Z 
2025-05-07T20:32:58.9689530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9690572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9690655Z     T=128,
2025-05-07T20:32:58.9690733Z     D=5120,
2025-05-07T20:32:58.9690824Z     scale_ub=1200.0,
2025-05-07T20:32:58.9690912Z     contiguous=False,
2025-05-07T20:32:58.9690996Z     compiled=True,
2025-05-07T20:32:58.9691079Z )
2025-05-07T20:32:58.9691293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9691470Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9691475Z 
2025-05-07T20:32:58.9691562Z     @given(
2025-05-07T20:32:58.9691685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9691790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9691906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9692029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9692149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9692234Z     )
2025-05-07T20:32:58.9692483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9692585Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9692663Z         self,
2025-05-07T20:32:58.9692739Z         T: int,
2025-05-07T20:32:58.9692826Z         D: int,
2025-05-07T20:32:58.9692927Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9693018Z         contiguous: bool,
2025-05-07T20:32:58.9693208Z         compiled: bool,
2025-05-07T20:32:58.9693289Z     ) -> None:
2025-05-07T20:32:58.9693448Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9693524Z     
2025-05-07T20:32:58.9693694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9693778Z     
2025-05-07T20:32:58.9693873Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9693999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9694097Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9694185Z         x0 = x[:, :D]
2025-05-07T20:32:58.9694266Z         x1 = x[:, D:]
2025-05-07T20:32:58.9694348Z     
2025-05-07T20:32:58.9694434Z         if contiguous:
2025-05-07T20:32:58.9694527Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9694622Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9694695Z     
2025-05-07T20:32:58.9694787Z         if scale_ub is not None:
2025-05-07T20:32:58.9694901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9695039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9695123Z             )
2025-05-07T20:32:58.9695205Z         else:
2025-05-07T20:32:58.9695301Z             scale_ub_tensor = None
2025-05-07T20:32:58.9695381Z     
2025-05-07T20:32:58.9695513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9695607Z             op = silu_mul_quant
2025-05-07T20:32:58.9695700Z             if compiled:
2025-05-07T20:32:58.9695867Z                 op = torch.compile(op)
2025-05-07T20:32:58.9695980Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9696063Z     
2025-05-07T20:32:58.9696156Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9696160Z 
2025-05-07T20:32:58.9696269Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9696399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9696501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9696609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9696984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9697080Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9697586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9697685Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9698050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9698342Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9698686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9698786Z     kernel = self.compile(
2025-05-07T20:32:58.9699173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9699350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9699487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9699492Z 
2025-05-07T20:32:58.9699696Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133258520>
2025-05-07T20:32:58.9700591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9701099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313337a290>}
2025-05-07T20:32:58.9701845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9702083Z context = <triton._C.libtriton.ir.context object at 0x7f31331f8570>
2025-05-07T20:32:58.9702125Z 
2025-05-07T20:32:58.9702294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9702570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9702677Z                            module_map=module_map)
2025-05-07T20:32:58.9702841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9702952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9703030Z E       ^
2025-05-07T20:32:58.9703390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9703395Z 
2025-05-07T20:32:58.9703806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9703813Z 
2025-05-07T20:32:58.9703917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9704150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9704230Z     T=16384,
2025-05-07T20:32:58.9704312Z     D=7168,
2025-05-07T20:32:58.9704396Z     scale_ub=1200.0,
2025-05-07T20:32:58.9704481Z     contiguous=True,
2025-05-07T20:32:58.9704570Z     compiled=True,
2025-05-07T20:32:58.9704645Z )
2025-05-07T20:32:58.9704903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9705090Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9705095Z 
2025-05-07T20:32:58.9705173Z     @given(
2025-05-07T20:32:58.9705291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9705395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9705514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9705637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9705756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9705831Z     )
2025-05-07T20:32:58.9706087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9706183Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9706262Z         self,
2025-05-07T20:32:58.9706345Z         T: int,
2025-05-07T20:32:58.9706422Z         D: int,
2025-05-07T20:32:58.9706524Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9706620Z         contiguous: bool,
2025-05-07T20:32:58.9706752Z         compiled: bool,
2025-05-07T20:32:58.9706832Z     ) -> None:
2025-05-07T20:32:58.9706932Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9707006Z     
2025-05-07T20:32:58.9707181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9707256Z     
2025-05-07T20:32:58.9707347Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9707479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9707572Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9707654Z         x0 = x[:, :D]
2025-05-07T20:32:58.9707744Z         x1 = x[:, D:]
2025-05-07T20:32:58.9707818Z     
2025-05-07T20:32:58.9707903Z         if contiguous:
2025-05-07T20:32:58.9708003Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9708094Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9708166Z     
2025-05-07T20:32:58.9708263Z         if scale_ub is not None:
2025-05-07T20:32:58.9708373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9708511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9708595Z             )
2025-05-07T20:32:58.9708674Z         else:
2025-05-07T20:32:58.9708775Z             scale_ub_tensor = None
2025-05-07T20:32:58.9708848Z     
2025-05-07T20:32:58.9708977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9709073Z             op = silu_mul_quant
2025-05-07T20:32:58.9709160Z             if compiled:
2025-05-07T20:32:58.9709332Z                 op = torch.compile(op)
2025-05-07T20:32:58.9709445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9709561Z     
2025-05-07T20:32:58.9709656Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9709660Z 
2025-05-07T20:32:58.9709766Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9709895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9710005Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9710109Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9714689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9714801Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9715311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9715413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9715787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9716025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9716376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9716476Z     kernel = self.compile(
2025-05-07T20:32:58.9716940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9717132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9717274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9717279Z 
2025-05-07T20:32:58.9717490Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f3f490>
2025-05-07T20:32:58.9718315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9718828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313337ad40>}
2025-05-07T20:32:58.9719577Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9719818Z context = <triton._C.libtriton.ir.context object at 0x7f3133117fb0>
2025-05-07T20:32:58.9719823Z 
2025-05-07T20:32:58.9719995Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9720269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9720380Z                            module_map=module_map)
2025-05-07T20:32:58.9720549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9720656Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9720733Z E       ^
2025-05-07T20:32:58.9721087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9721092Z 
2025-05-07T20:32:58.9721511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9721518Z 
2025-05-07T20:32:58.9721620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9721844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9721923Z     T=16384,
2025-05-07T20:32:58.9721999Z     D=5120,
2025-05-07T20:32:58.9722084Z     scale_ub=1200.0,
2025-05-07T20:32:58.9722167Z     contiguous=True,
2025-05-07T20:32:58.9722246Z     compiled=False,
2025-05-07T20:32:58.9722324Z )
2025-05-07T20:32:58.9722584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9722802Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9722814Z 
2025-05-07T20:32:58.9722892Z     @given(
2025-05-07T20:32:58.9723011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9723116Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9723233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9723351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9723477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9723549Z     )
2025-05-07T20:32:58.9723796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9723896Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9723972Z         self,
2025-05-07T20:32:58.9724049Z         T: int,
2025-05-07T20:32:58.9724129Z         D: int,
2025-05-07T20:32:58.9724224Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9724325Z         contiguous: bool,
2025-05-07T20:32:58.9724409Z         compiled: bool,
2025-05-07T20:32:58.9724488Z     ) -> None:
2025-05-07T20:32:58.9724588Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9724661Z     
2025-05-07T20:32:58.9724828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9724909Z     
2025-05-07T20:32:58.9725000Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9725165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9725264Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9725344Z         x0 = x[:, :D]
2025-05-07T20:32:58.9725420Z         x1 = x[:, D:]
2025-05-07T20:32:58.9725499Z     
2025-05-07T20:32:58.9725583Z         if contiguous:
2025-05-07T20:32:58.9725674Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9725768Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9725841Z     
2025-05-07T20:32:58.9725934Z         if scale_ub is not None:
2025-05-07T20:32:58.9726045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9726184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9726270Z             )
2025-05-07T20:32:58.9726348Z         else:
2025-05-07T20:32:58.9726443Z             scale_ub_tensor = None
2025-05-07T20:32:58.9726518Z     
2025-05-07T20:32:58.9726647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9726742Z             op = silu_mul_quant
2025-05-07T20:32:58.9726828Z             if compiled:
2025-05-07T20:32:58.9726970Z                 op = torch.compile(op)
2025-05-07T20:32:58.9727077Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9727154Z     
2025-05-07T20:32:58.9727245Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9727250Z 
2025-05-07T20:32:58.9727352Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9727478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9727576Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9727679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9728181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9728279Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9728638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9728859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9729208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9729300Z     kernel = self.compile(
2025-05-07T20:32:58.9729681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9729859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9730033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9730038Z 
2025-05-07T20:32:58.9730285Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f4d390>
2025-05-07T20:32:58.9731071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9731563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313337bac0>}
2025-05-07T20:32:58.9732308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9732499Z context = <triton._C.libtriton.ir.context object at 0x7f3132f303f0>
2025-05-07T20:32:58.9732507Z 
2025-05-07T20:32:58.9732681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9732943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9733048Z                            module_map=module_map)
2025-05-07T20:32:58.9733219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9733358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9733444Z E       ^
2025-05-07T20:32:58.9733806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9733811Z 
2025-05-07T20:32:58.9734220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9734225Z 
2025-05-07T20:32:58.9734334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9734554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9734633Z     T=1,
2025-05-07T20:32:58.9734716Z     D=7168,
2025-05-07T20:32:58.9734800Z     scale_ub=1200.0,
2025-05-07T20:32:58.9734888Z     contiguous=False,
2025-05-07T20:32:58.9734972Z     compiled=False,
2025-05-07T20:32:58.9735043Z )
2025-05-07T20:32:58.9735263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9735437Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9735482Z 
2025-05-07T20:32:58.9735555Z     @given(
2025-05-07T20:32:58.9735681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9735778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9735894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9736016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9736133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9736214Z     )
2025-05-07T20:32:58.9736464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9736561Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9736640Z         self,
2025-05-07T20:32:58.9736713Z         T: int,
2025-05-07T20:32:58.9736786Z         D: int,
2025-05-07T20:32:58.9736890Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9736980Z         contiguous: bool,
2025-05-07T20:32:58.9737062Z         compiled: bool,
2025-05-07T20:32:58.9737145Z     ) -> None:
2025-05-07T20:32:58.9737244Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9737313Z     
2025-05-07T20:32:58.9737482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9737556Z     
2025-05-07T20:32:58.9737656Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9737778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9737870Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9737946Z         x0 = x[:, :D]
2025-05-07T20:32:58.9738079Z         x1 = x[:, D:]
2025-05-07T20:32:58.9738149Z     
2025-05-07T20:32:58.9738235Z         if contiguous:
2025-05-07T20:32:58.9738368Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9738456Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9738524Z     
2025-05-07T20:32:58.9738617Z         if scale_ub is not None:
2025-05-07T20:32:58.9738721Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9738857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9738938Z             )
2025-05-07T20:32:58.9739016Z         else:
2025-05-07T20:32:58.9739108Z             scale_ub_tensor = None
2025-05-07T20:32:58.9739183Z     
2025-05-07T20:32:58.9739312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9739398Z             op = silu_mul_quant
2025-05-07T20:32:58.9739486Z             if compiled:
2025-05-07T20:32:58.9739586Z                 op = torch.compile(op)
2025-05-07T20:32:58.9739696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9739868Z     
2025-05-07T20:32:58.9739960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9739965Z 
2025-05-07T20:32:58.9740068Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9740194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9740295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9740395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9740934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9741041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9741398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9741617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9741960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9742057Z     kernel = self.compile(
2025-05-07T20:32:58.9742446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9742627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9742749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9742754Z 
2025-05-07T20:32:58.9742963Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f006a0>
2025-05-07T20:32:58.9743816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9744316Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9c4c0>}
2025-05-07T20:32:58.9745059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9745244Z context = <triton._C.libtriton.ir.context object at 0x7f3132f27d70>
2025-05-07T20:32:58.9745249Z 
2025-05-07T20:32:58.9745421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9745684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9745794Z                            module_map=module_map)
2025-05-07T20:32:58.9745955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9746049Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9746129Z E       ^
2025-05-07T20:32:58.9746478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9746527Z 
2025-05-07T20:32:58.9746980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9746985Z 
2025-05-07T20:32:58.9747093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9747312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9747387Z     T=4096,
2025-05-07T20:32:58.9747460Z     D=7168,
2025-05-07T20:32:58.9747545Z     scale_ub=1200.0,
2025-05-07T20:32:58.9747638Z     contiguous=False,
2025-05-07T20:32:58.9747716Z     compiled=True,
2025-05-07T20:32:58.9747786Z )
2025-05-07T20:32:58.9748031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9748229Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9748234Z 
2025-05-07T20:32:58.9748304Z     @given(
2025-05-07T20:32:58.9748428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9748532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9748652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9748768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9748879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9748955Z     )
2025-05-07T20:32:58.9749197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9749332Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9749416Z         self,
2025-05-07T20:32:58.9749487Z         T: int,
2025-05-07T20:32:58.9749558Z         D: int,
2025-05-07T20:32:58.9749657Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9749744Z         contiguous: bool,
2025-05-07T20:32:58.9749825Z         compiled: bool,
2025-05-07T20:32:58.9749908Z     ) -> None:
2025-05-07T20:32:58.9750000Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9750070Z     
2025-05-07T20:32:58.9750237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9750312Z     
2025-05-07T20:32:58.9750404Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9750530Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9750617Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9750695Z         x0 = x[:, :D]
2025-05-07T20:32:58.9750771Z         x1 = x[:, D:]
2025-05-07T20:32:58.9750842Z     
2025-05-07T20:32:58.9750930Z         if contiguous:
2025-05-07T20:32:58.9751027Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9751157Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9751233Z     
2025-05-07T20:32:58.9751319Z         if scale_ub is not None:
2025-05-07T20:32:58.9751428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9751562Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9751635Z             )
2025-05-07T20:32:58.9751714Z         else:
2025-05-07T20:32:58.9751804Z             scale_ub_tensor = None
2025-05-07T20:32:58.9751878Z     
2025-05-07T20:32:58.9752008Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9752095Z             op = silu_mul_quant
2025-05-07T20:32:58.9752182Z             if compiled:
2025-05-07T20:32:58.9752286Z                 op = torch.compile(op)
2025-05-07T20:32:58.9752389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9752461Z     
2025-05-07T20:32:58.9752558Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9752563Z 
2025-05-07T20:32:58.9752662Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9752793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9752894Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9752992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9753361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9753453Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9753950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9754133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9754493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9754719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9755061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9755158Z     kernel = self.compile(
2025-05-07T20:32:58.9755548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9755721Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9755846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9755857Z 
2025-05-07T20:32:58.9756061Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132f02770>
2025-05-07T20:32:58.9756828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9757371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9d1b0>}
2025-05-07T20:32:58.9758122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9758316Z context = <triton._C.libtriton.ir.context object at 0x7f3132f14130>
2025-05-07T20:32:58.9758320Z 
2025-05-07T20:32:58.9758484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9758749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9758863Z                            module_map=module_map)
2025-05-07T20:32:58.9759020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9759120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9759196Z E       ^
2025-05-07T20:32:58.9759548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9759592Z 
2025-05-07T20:32:58.9760011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9760015Z 
2025-05-07T20:32:58.9760115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9760331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9760413Z     T=128,
2025-05-07T20:32:58.9760488Z     D=7168,
2025-05-07T20:32:58.9760567Z     scale_ub=1200.0,
2025-05-07T20:32:58.9760648Z     contiguous=False,
2025-05-07T20:32:58.9760730Z     compiled=True,
2025-05-07T20:32:58.9760802Z )
2025-05-07T20:32:58.9761012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9761180Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9761184Z 
2025-05-07T20:32:58.9761264Z     @given(
2025-05-07T20:32:58.9761382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9761479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9761599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9761713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9761827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9761899Z     )
2025-05-07T20:32:58.9762146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9762285Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9762355Z         self,
2025-05-07T20:32:58.9762427Z         T: int,
2025-05-07T20:32:58.9762545Z         D: int,
2025-05-07T20:32:58.9762647Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9762737Z         contiguous: bool,
2025-05-07T20:32:58.9762823Z         compiled: bool,
2025-05-07T20:32:58.9762900Z     ) -> None:
2025-05-07T20:32:58.9762990Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9763063Z     
2025-05-07T20:32:58.9763231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9763312Z     
2025-05-07T20:32:58.9763402Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9763525Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9763615Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9763693Z         x0 = x[:, :D]
2025-05-07T20:32:58.9763772Z         x1 = x[:, D:]
2025-05-07T20:32:58.9763842Z     
2025-05-07T20:32:58.9763923Z         if contiguous:
2025-05-07T20:32:58.9764016Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9764108Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9764177Z     
2025-05-07T20:32:58.9764267Z         if scale_ub is not None:
2025-05-07T20:32:58.9764375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9764507Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9764585Z             )
2025-05-07T20:32:58.9764659Z         else:
2025-05-07T20:32:58.9764799Z             scale_ub_tensor = None
2025-05-07T20:32:58.9764876Z     
2025-05-07T20:32:58.9765003Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9765089Z             op = silu_mul_quant
2025-05-07T20:32:58.9765178Z             if compiled:
2025-05-07T20:32:58.9765276Z                 op = torch.compile(op)
2025-05-07T20:32:58.9765378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9765448Z     
2025-05-07T20:32:58.9765536Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9765544Z 
2025-05-07T20:32:58.9765640Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9765771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9765871Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9765976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9766337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9766433Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9766941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9767081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9767443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9767663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9768000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9768095Z     kernel = self.compile(
2025-05-07T20:32:58.9768474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9768650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9768779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9768783Z 
2025-05-07T20:32:58.9768988Z self = <triton.compiler.compiler.ASTSource object at 0x7f31336f9b10>
2025-05-07T20:32:58.9769755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9770247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9c0d0>}
2025-05-07T20:32:58.9771079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9771274Z context = <triton._C.libtriton.ir.context object at 0x7f31336eb130>
2025-05-07T20:32:58.9771278Z 
2025-05-07T20:32:58.9771445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9771717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9771823Z                            module_map=module_map)
2025-05-07T20:32:58.9771981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9772083Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9772160Z E       ^
2025-05-07T20:32:58.9772516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9772523Z 
2025-05-07T20:32:58.9772946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9772951Z 
2025-05-07T20:32:58.9773052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9773275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9773410Z     T=2048,
2025-05-07T20:32:58.9773487Z     D=7168,
2025-05-07T20:32:58.9773570Z     scale_ub=None,
2025-05-07T20:32:58.9773650Z     contiguous=True,
2025-05-07T20:32:58.9773739Z     compiled=True,
2025-05-07T20:32:58.9773810Z )
2025-05-07T20:32:58.9774022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9774196Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9774201Z 
2025-05-07T20:32:58.9774272Z     @given(
2025-05-07T20:32:58.9774387Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9774486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9774603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9774717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9774834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9774908Z     )
2025-05-07T20:32:58.9775155Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9775291Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9775366Z         self,
2025-05-07T20:32:58.9775444Z         T: int,
2025-05-07T20:32:58.9775516Z         D: int,
2025-05-07T20:32:58.9775613Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9775709Z         contiguous: bool,
2025-05-07T20:32:58.9775792Z         compiled: bool,
2025-05-07T20:32:58.9775867Z     ) -> None:
2025-05-07T20:32:58.9775964Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9776038Z     
2025-05-07T20:32:58.9776204Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9776281Z     
2025-05-07T20:32:58.9776375Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9776501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9776587Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9776665Z         x0 = x[:, :D]
2025-05-07T20:32:58.9776750Z         x1 = x[:, D:]
2025-05-07T20:32:58.9776822Z     
2025-05-07T20:32:58.9776908Z         if contiguous:
2025-05-07T20:32:58.9777006Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9777092Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9777163Z     
2025-05-07T20:32:58.9777254Z         if scale_ub is not None:
2025-05-07T20:32:58.9777358Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9777488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9777564Z             )
2025-05-07T20:32:58.9777638Z         else:
2025-05-07T20:32:58.9777777Z             scale_ub_tensor = None
2025-05-07T20:32:58.9777850Z     
2025-05-07T20:32:58.9778018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9778114Z             op = silu_mul_quant
2025-05-07T20:32:58.9778196Z             if compiled:
2025-05-07T20:32:58.9778294Z                 op = torch.compile(op)
2025-05-07T20:32:58.9778402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9778472Z     
2025-05-07T20:32:58.9778561Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9778568Z 
2025-05-07T20:32:58.9778670Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9778794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9778895Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9778996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9779357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9779457Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9780042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9780140Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9780508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9780775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9781119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9781218Z     kernel = self.compile(
2025-05-07T20:32:58.9781595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9781771Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9781894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9781902Z 
2025-05-07T20:32:58.9782106Z self = <triton.compiler.compiler.ASTSource object at 0x7f3133661c60>
2025-05-07T20:32:58.9782884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9783380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132f9e560>}
2025-05-07T20:32:58.9784173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9784359Z context = <triton._C.libtriton.ir.context object at 0x7f31336addf0>
2025-05-07T20:32:58.9784364Z 
2025-05-07T20:32:58.9784537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9784802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9784904Z                            module_map=module_map)
2025-05-07T20:32:58.9785070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9785165Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9785241Z E       ^
2025-05-07T20:32:58.9785599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9785606Z 
2025-05-07T20:32:58.9786019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9786024Z 
2025-05-07T20:32:58.9786129Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9786344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9786460Z     T=16384,
2025-05-07T20:32:58.9786536Z     D=5120,
2025-05-07T20:32:58.9786613Z     scale_ub=None,
2025-05-07T20:32:58.9786735Z     contiguous=False,
2025-05-07T20:32:58.9786824Z     compiled=False,
2025-05-07T20:32:58.9786891Z )
2025-05-07T20:32:58.9787102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9787282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9787289Z 
2025-05-07T20:32:58.9787363Z     @given(
2025-05-07T20:32:58.9787485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9787581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9787694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9787814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9787928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9788001Z     )
2025-05-07T20:32:58.9788252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9788346Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9788420Z         self,
2025-05-07T20:32:58.9788495Z         T: int,
2025-05-07T20:32:58.9788565Z         D: int,
2025-05-07T20:32:58.9788672Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9788764Z         contiguous: bool,
2025-05-07T20:32:58.9788846Z         compiled: bool,
2025-05-07T20:32:58.9788929Z     ) -> None:
2025-05-07T20:32:58.9789067Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9789142Z     
2025-05-07T20:32:58.9789316Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9789389Z     
2025-05-07T20:32:58.9789479Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9789604Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9792155Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9792179Z 
2025-05-07T20:32:58.9792329Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.9792487Z 
2025-05-07T20:32:58.9792595Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9792826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9792903Z     T=4096,
2025-05-07T20:32:58.9792978Z     D=7168,
2025-05-07T20:32:58.9793066Z     scale_ub=1200.0,
2025-05-07T20:32:58.9793148Z     contiguous=True,
2025-05-07T20:32:58.9793228Z     compiled=True,
2025-05-07T20:32:58.9793308Z )
2025-05-07T20:32:58.9793525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9793701Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9793712Z 
2025-05-07T20:32:58.9793785Z     @given(
2025-05-07T20:32:58.9793903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9794002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9794114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9794229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9794352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9794423Z     )
2025-05-07T20:32:58.9794669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9794769Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9794845Z         self,
2025-05-07T20:32:58.9794916Z         T: int,
2025-05-07T20:32:58.9794991Z         D: int,
2025-05-07T20:32:58.9795085Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9795256Z         contiguous: bool,
2025-05-07T20:32:58.9795339Z         compiled: bool,
2025-05-07T20:32:58.9795474Z     ) -> None:
2025-05-07T20:32:58.9795574Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9795648Z     
2025-05-07T20:32:58.9795816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9795888Z     
2025-05-07T20:32:58.9795977Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9796104Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9797880Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9797890Z 
2025-05-07T20:32:58.9798011Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.9798016Z 
2025-05-07T20:32:58.9798119Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9798335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9798417Z     T=16384,
2025-05-07T20:32:58.9798551Z     D=7168,
2025-05-07T20:32:58.9798637Z     scale_ub=None,
2025-05-07T20:32:58.9798725Z     contiguous=False,
2025-05-07T20:32:58.9798808Z     compiled=False,
2025-05-07T20:32:58.9798880Z )
2025-05-07T20:32:58.9799096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9799267Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9799271Z 
2025-05-07T20:32:58.9799346Z     @given(
2025-05-07T20:32:58.9799469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9799567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9799686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9799803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9799913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9799991Z     )
2025-05-07T20:32:58.9800238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9800330Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9800456Z         self,
2025-05-07T20:32:58.9800532Z         T: int,
2025-05-07T20:32:58.9800604Z         D: int,
2025-05-07T20:32:58.9800708Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9800800Z         contiguous: bool,
2025-05-07T20:32:58.9800882Z         compiled: bool,
2025-05-07T20:32:58.9800961Z     ) -> None:
2025-05-07T20:32:58.9801051Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9801121Z     
2025-05-07T20:32:58.9801292Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9803059Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9803071Z 
2025-05-07T20:32:58.9803186Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9803191Z 
2025-05-07T20:32:58.9803293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9803512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9803588Z     T=2048,
2025-05-07T20:32:58.9803709Z     D=7168,
2025-05-07T20:32:58.9803796Z     scale_ub=1200.0,
2025-05-07T20:32:58.9803875Z     contiguous=True,
2025-05-07T20:32:58.9803993Z     compiled=True,
2025-05-07T20:32:58.9804068Z )
2025-05-07T20:32:58.9804279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9804449Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9804454Z 
2025-05-07T20:32:58.9804531Z     @given(
2025-05-07T20:32:58.9804644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9804747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9804858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9804973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9805087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9805156Z     )
2025-05-07T20:32:58.9805394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9805491Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9805562Z         self,
2025-05-07T20:32:58.9805640Z         T: int,
2025-05-07T20:32:58.9805714Z         D: int,
2025-05-07T20:32:58.9805811Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9805906Z         contiguous: bool,
2025-05-07T20:32:58.9805989Z         compiled: bool,
2025-05-07T20:32:58.9806063Z     ) -> None:
2025-05-07T20:32:58.9806204Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9806277Z     
2025-05-07T20:32:58.9806443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9806520Z     
2025-05-07T20:32:58.9806607Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9806730Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9808484Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9808496Z 
2025-05-07T20:32:58.9808611Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.9808620Z 
2025-05-07T20:32:58.9808761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9808979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9809059Z     T=2048,
2025-05-07T20:32:58.9809133Z     D=7168,
2025-05-07T20:32:58.9809212Z     scale_ub=None,
2025-05-07T20:32:58.9809299Z     contiguous=True,
2025-05-07T20:32:58.9809378Z     compiled=False,
2025-05-07T20:32:58.9809450Z )
2025-05-07T20:32:58.9809663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9809832Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9809838Z 
2025-05-07T20:32:58.9809908Z     @given(
2025-05-07T20:32:58.9810024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9810118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9810231Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9810345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9810458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9810530Z     )
2025-05-07T20:32:58.9810768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9810856Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9810928Z         self,
2025-05-07T20:32:58.9811000Z         T: int,
2025-05-07T20:32:58.9811070Z         D: int,
2025-05-07T20:32:58.9811168Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9811330Z         contiguous: bool,
2025-05-07T20:32:58.9811418Z         compiled: bool,
2025-05-07T20:32:58.9811492Z     ) -> None:
2025-05-07T20:32:58.9811623Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9811702Z     
2025-05-07T20:32:58.9811864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9811934Z     
2025-05-07T20:32:58.9812026Z >       x_sign = torch.sign(x)
2025-05-07T20:32:58.9813963Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9813975Z 
2025-05-07T20:32:58.9814099Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:58.9814104Z 
2025-05-07T20:32:58.9814202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9814418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9814495Z     T=1,
2025-05-07T20:32:58.9814569Z     D=7168,
2025-05-07T20:32:58.9814646Z     scale_ub=1200.0,
2025-05-07T20:32:58.9814777Z     contiguous=True,
2025-05-07T20:32:58.9814857Z     compiled=False,
2025-05-07T20:32:58.9814932Z )
2025-05-07T20:32:58.9815142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9815303Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9815307Z 
2025-05-07T20:32:58.9815385Z     @given(
2025-05-07T20:32:58.9815498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9815591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9815710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9815822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9815941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9816012Z     )
2025-05-07T20:32:58.9816252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9816347Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9816417Z         self,
2025-05-07T20:32:58.9816492Z         T: int,
2025-05-07T20:32:58.9816615Z         D: int,
2025-05-07T20:32:58.9816709Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9816794Z         contiguous: bool,
2025-05-07T20:32:58.9816879Z         compiled: bool,
2025-05-07T20:32:58.9816957Z     ) -> None:
2025-05-07T20:32:58.9817050Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9817123Z     
2025-05-07T20:32:58.9817287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9817355Z     
2025-05-07T20:32:58.9817452Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9817575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9817672Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9817749Z         x0 = x[:, :D]
2025-05-07T20:32:58.9817826Z         x1 = x[:, D:]
2025-05-07T20:32:58.9817900Z     
2025-05-07T20:32:58.9817982Z         if contiguous:
2025-05-07T20:32:58.9818072Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9818168Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9818239Z     
2025-05-07T20:32:58.9818328Z         if scale_ub is not None:
2025-05-07T20:32:58.9818434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9818566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9818641Z             )
2025-05-07T20:32:58.9818725Z         else:
2025-05-07T20:32:58.9818817Z             scale_ub_tensor = None
2025-05-07T20:32:58.9818890Z     
2025-05-07T20:32:58.9819017Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9819150Z             op = silu_mul_quant
2025-05-07T20:32:58.9819239Z             if compiled:
2025-05-07T20:32:58.9819404Z                 op = torch.compile(op)
2025-05-07T20:32:58.9819512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9819584Z     
2025-05-07T20:32:58.9819671Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9819675Z 
2025-05-07T20:32:58.9819820Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9819954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9820057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9820156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9820656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9820752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9821117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9821341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9821681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9821781Z     kernel = self.compile(
2025-05-07T20:32:58.9822167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9822390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9822518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9822523Z 
2025-05-07T20:32:58.9822726Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132dcce50>
2025-05-07T20:32:58.9823498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9824005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5c4c0>}
2025-05-07T20:32:58.9824751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9824975Z context = <triton._C.libtriton.ir.context object at 0x7f3132d904f0>
2025-05-07T20:32:58.9824980Z 
2025-05-07T20:32:58.9825143Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9825406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9825510Z                            module_map=module_map)
2025-05-07T20:32:58.9825673Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9825772Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9825846Z E       ^
2025-05-07T20:32:58.9826201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9826206Z 
2025-05-07T20:32:58.9826613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9826618Z 
2025-05-07T20:32:58.9826727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9826947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9827022Z     T=128,
2025-05-07T20:32:58.9827102Z     D=5120,
2025-05-07T20:32:58.9827182Z     scale_ub=None,
2025-05-07T20:32:58.9827261Z     contiguous=True,
2025-05-07T20:32:58.9827344Z     compiled=False,
2025-05-07T20:32:58.9827414Z )
2025-05-07T20:32:58.9827625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9827838Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9827843Z 
2025-05-07T20:32:58.9827978Z     @given(
2025-05-07T20:32:58.9828111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9828217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9828327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9828444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9828556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9828628Z     )
2025-05-07T20:32:58.9828871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9828961Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9829034Z         self,
2025-05-07T20:32:58.9829109Z         T: int,
2025-05-07T20:32:58.9829177Z         D: int,
2025-05-07T20:32:58.9829276Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9829364Z         contiguous: bool,
2025-05-07T20:32:58.9829447Z         compiled: bool,
2025-05-07T20:32:58.9829526Z     ) -> None:
2025-05-07T20:32:58.9829618Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9829691Z     
2025-05-07T20:32:58.9829856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9829926Z     
2025-05-07T20:32:58.9830014Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9830141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9830273Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9830356Z         x0 = x[:, :D]
2025-05-07T20:32:58.9830433Z         x1 = x[:, D:]
2025-05-07T20:32:58.9830500Z     
2025-05-07T20:32:58.9830580Z         if contiguous:
2025-05-07T20:32:58.9830675Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9830763Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9830839Z     
2025-05-07T20:32:58.9830927Z         if scale_ub is not None:
2025-05-07T20:32:58.9831029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9831169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9831242Z             )
2025-05-07T20:32:58.9831323Z         else:
2025-05-07T20:32:58.9831420Z             scale_ub_tensor = None
2025-05-07T20:32:58.9831488Z     
2025-05-07T20:32:58.9831614Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9831707Z             op = silu_mul_quant
2025-05-07T20:32:58.9831790Z             if compiled:
2025-05-07T20:32:58.9831893Z                 op = torch.compile(op)
2025-05-07T20:32:58.9832047Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9832118Z     
2025-05-07T20:32:58.9832212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9832217Z 
2025-05-07T20:32:58.9832313Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9832437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9832540Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9832635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9833130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9833228Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9833581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9833811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9834147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9834240Z     kernel = self.compile(
2025-05-07T20:32:58.9834625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9834799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9834922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9834975Z 
2025-05-07T20:32:58.9835179Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132ef5780>
2025-05-07T20:32:58.9839212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9839747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5c940>}
2025-05-07T20:32:58.9840509Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9840696Z context = <triton._C.libtriton.ir.context object at 0x7f3132e77930>
2025-05-07T20:32:58.9840702Z 
2025-05-07T20:32:58.9840875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9841137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9841244Z                            module_map=module_map)
2025-05-07T20:32:58.9841411Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9841507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9841580Z E       ^
2025-05-07T20:32:58.9841979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9841988Z 
2025-05-07T20:32:58.9842404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9842410Z 
2025-05-07T20:32:58.9842515Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9842732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9842805Z     T=128,
2025-05-07T20:32:58.9842881Z     D=7168,
2025-05-07T20:32:58.9842961Z     scale_ub=None,
2025-05-07T20:32:58.9843042Z     contiguous=True,
2025-05-07T20:32:58.9843132Z     compiled=False,
2025-05-07T20:32:58.9843205Z )
2025-05-07T20:32:58.9843415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9843587Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9843592Z 
2025-05-07T20:32:58.9843665Z     @given(
2025-05-07T20:32:58.9843783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9843923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9844037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9844158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9844269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9844335Z     )
2025-05-07T20:32:58.9844580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9844673Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9844742Z         self,
2025-05-07T20:32:58.9844819Z         T: int,
2025-05-07T20:32:58.9844890Z         D: int,
2025-05-07T20:32:58.9844989Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9845078Z         contiguous: bool,
2025-05-07T20:32:58.9845163Z         compiled: bool,
2025-05-07T20:32:58.9845244Z     ) -> None:
2025-05-07T20:32:58.9845338Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9845409Z     
2025-05-07T20:32:58.9845579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9845648Z     
2025-05-07T20:32:58.9845741Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9845870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9845955Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9846033Z         x0 = x[:, :D]
2025-05-07T20:32:58.9846116Z         x1 = x[:, D:]
2025-05-07T20:32:58.9846185Z     
2025-05-07T20:32:58.9846311Z         if contiguous:
2025-05-07T20:32:58.9846404Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9846490Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9846602Z     
2025-05-07T20:32:58.9846696Z         if scale_ub is not None:
2025-05-07T20:32:58.9846801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9846942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9847015Z             )
2025-05-07T20:32:58.9847090Z         else:
2025-05-07T20:32:58.9847188Z             scale_ub_tensor = None
2025-05-07T20:32:58.9847263Z     
2025-05-07T20:32:58.9847392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9847485Z             op = silu_mul_quant
2025-05-07T20:32:58.9847572Z             if compiled:
2025-05-07T20:32:58.9847669Z                 op = torch.compile(op)
2025-05-07T20:32:58.9847779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9847849Z     
2025-05-07T20:32:58.9847951Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9847961Z 
2025-05-07T20:32:58.9848070Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9848227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9848333Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9848433Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9848972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9849077Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9849434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9849659Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9849999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9850090Z     kernel = self.compile(
2025-05-07T20:32:58.9850481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9850655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9850778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9850788Z 
2025-05-07T20:32:58.9850990Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132ef4430>
2025-05-07T20:32:58.9851761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9852308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5d240>}
2025-05-07T20:32:58.9853047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9853240Z context = <triton._C.libtriton.ir.context object at 0x7f3132e67030>
2025-05-07T20:32:58.9853244Z 
2025-05-07T20:32:58.9853404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9853665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9853775Z                            module_map=module_map)
2025-05-07T20:32:58.9853935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9854034Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9854112Z E       ^
2025-05-07T20:32:58.9854465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9854470Z 
2025-05-07T20:32:58.9854881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9854952Z 
2025-05-07T20:32:58.9855089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9855307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9855384Z     T=2048,
2025-05-07T20:32:58.9855453Z     D=7168,
2025-05-07T20:32:58.9855534Z     scale_ub=1200.0,
2025-05-07T20:32:58.9855617Z     contiguous=True,
2025-05-07T20:32:58.9855698Z     compiled=False,
2025-05-07T20:32:58.9855777Z )
2025-05-07T20:32:58.9855990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9856162Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9856166Z 
2025-05-07T20:32:58.9856239Z     @given(
2025-05-07T20:32:58.9856356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9856452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9856577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9856692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9856811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9856882Z     )
2025-05-07T20:32:58.9857124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9857217Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9857286Z         self,
2025-05-07T20:32:58.9857402Z         T: int,
2025-05-07T20:32:58.9857481Z         D: int,
2025-05-07T20:32:58.9857577Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9857668Z         contiguous: bool,
2025-05-07T20:32:58.9857755Z         compiled: bool,
2025-05-07T20:32:58.9857830Z     ) -> None:
2025-05-07T20:32:58.9857919Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9857992Z     
2025-05-07T20:32:58.9858155Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9860099Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9860164Z 
2025-05-07T20:32:58.9860322Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9860328Z 
2025-05-07T20:32:58.9860471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9860778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9860875Z     T=1,
2025-05-07T20:32:58.9860978Z     D=5120,
2025-05-07T20:32:58.9861088Z     scale_ub=1200.0,
2025-05-07T20:32:58.9861202Z     contiguous=True,
2025-05-07T20:32:58.9861317Z     compiled=False,
2025-05-07T20:32:58.9861413Z )
2025-05-07T20:32:58.9861720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9861947Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9861954Z 
2025-05-07T20:32:58.9862052Z     @given(
2025-05-07T20:32:58.9862214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9862348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9862505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9862665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9862822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9862917Z     )
2025-05-07T20:32:58.9863184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9863279Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9863351Z         self,
2025-05-07T20:32:58.9863475Z         T: int,
2025-05-07T20:32:58.9863549Z         D: int,
2025-05-07T20:32:58.9863642Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9863772Z         contiguous: bool,
2025-05-07T20:32:58.9863860Z         compiled: bool,
2025-05-07T20:32:58.9863936Z     ) -> None:
2025-05-07T20:32:58.9864029Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9864099Z     
2025-05-07T20:32:58.9864264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9864336Z     
2025-05-07T20:32:58.9864430Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9864552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9864640Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9864716Z         x0 = x[:, :D]
2025-05-07T20:32:58.9864791Z         x1 = x[:, D:]
2025-05-07T20:32:58.9864868Z     
2025-05-07T20:32:58.9864946Z         if contiguous:
2025-05-07T20:32:58.9865034Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9865122Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9865198Z     
2025-05-07T20:32:58.9865288Z         if scale_ub is not None:
2025-05-07T20:32:58.9865397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9865528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9865602Z             )
2025-05-07T20:32:58.9865675Z         else:
2025-05-07T20:32:58.9865767Z             scale_ub_tensor = None
2025-05-07T20:32:58.9865843Z     
2025-05-07T20:32:58.9866011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9866102Z             op = silu_mul_quant
2025-05-07T20:32:58.9866189Z             if compiled:
2025-05-07T20:32:58.9866285Z                 op = torch.compile(op)
2025-05-07T20:32:58.9866388Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9866460Z     
2025-05-07T20:32:58.9866546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9866550Z 
2025-05-07T20:32:58.9866647Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9866776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9866872Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9866974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9867466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9867561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9867922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9868220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9868582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9868675Z     kernel = self.compile(
2025-05-07T20:32:58.9869052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9869227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9869356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9869361Z 
2025-05-07T20:32:58.9869563Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132e2f310>
2025-05-07T20:32:58.9870341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9870841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3132d5e200>}
2025-05-07T20:32:58.9871585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9871816Z context = <triton._C.libtriton.ir.context object at 0x7f31330a1170>
2025-05-07T20:32:58.9871858Z 
2025-05-07T20:32:58.9872028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9872285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9872395Z                            module_map=module_map)
2025-05-07T20:32:58.9872563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9872662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9872734Z E       ^
2025-05-07T20:32:58.9873088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9873093Z 
2025-05-07T20:32:58.9873502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9873510Z 
2025-05-07T20:32:58.9873616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9873838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9873913Z     T=2048,
2025-05-07T20:32:58.9873991Z     D=5120,
2025-05-07T20:32:58.9874067Z     scale_ub=None,
2025-05-07T20:32:58.9874149Z     contiguous=True,
2025-05-07T20:32:58.9874232Z     compiled=False,
2025-05-07T20:32:58.9874302Z )
2025-05-07T20:32:58.9874560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9874735Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9874739Z 
2025-05-07T20:32:58.9874814Z     @given(
2025-05-07T20:32:58.9874930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9875025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9875138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9875256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9875368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9875440Z     )
2025-05-07T20:32:58.9875698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9875789Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9875862Z         self,
2025-05-07T20:32:58.9875933Z         T: int,
2025-05-07T20:32:58.9876003Z         D: int,
2025-05-07T20:32:58.9876108Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9876194Z         contiguous: bool,
2025-05-07T20:32:58.9876322Z         compiled: bool,
2025-05-07T20:32:58.9876401Z     ) -> None:
2025-05-07T20:32:58.9876490Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9876561Z     
2025-05-07T20:32:58.9876727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9876797Z     
2025-05-07T20:32:58.9876888Z >       x_sign = torch.sign(x)
2025-05-07T20:32:58.9878717Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9878727Z 
2025-05-07T20:32:58.9878845Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:58.9878853Z 
2025-05-07T20:32:58.9878953Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9879171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9879248Z     T=16384,
2025-05-07T20:32:58.9879320Z     D=5120,
2025-05-07T20:32:58.9879398Z     scale_ub=None,
2025-05-07T20:32:58.9879480Z     contiguous=True,
2025-05-07T20:32:58.9879605Z     compiled=False,
2025-05-07T20:32:58.9879677Z )
2025-05-07T20:32:58.9879930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9880105Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9880109Z 
2025-05-07T20:32:58.9880191Z     @given(
2025-05-07T20:32:58.9880304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9880399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9880515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9880630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9880743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9880816Z     )
2025-05-07T20:32:58.9881058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9881150Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9881229Z         self,
2025-05-07T20:32:58.9881303Z         T: int,
2025-05-07T20:32:58.9881378Z         D: int,
2025-05-07T20:32:58.9881477Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9881566Z         contiguous: bool,
2025-05-07T20:32:58.9881651Z         compiled: bool,
2025-05-07T20:32:58.9881726Z     ) -> None:
2025-05-07T20:32:58.9881816Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9881886Z     
2025-05-07T20:32:58.9882049Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9883856Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9883873Z 
2025-05-07T20:32:58.9883988Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9883995Z 
2025-05-07T20:32:58.9884093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9884312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9884385Z     T=4096,
2025-05-07T20:32:58.9884455Z     D=5120,
2025-05-07T20:32:58.9884538Z     scale_ub=None,
2025-05-07T20:32:58.9884620Z     contiguous=True,
2025-05-07T20:32:58.9884769Z     compiled=False,
2025-05-07T20:32:58.9884841Z )
2025-05-07T20:32:58.9885050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9885223Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9885227Z 
2025-05-07T20:32:58.9885302Z     @given(
2025-05-07T20:32:58.9885414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9885514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9885627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9885741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9885855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9885925Z     )
2025-05-07T20:32:58.9886171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9886261Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9886336Z         self,
2025-05-07T20:32:58.9886418Z         T: int,
2025-05-07T20:32:58.9886490Z         D: int,
2025-05-07T20:32:58.9886584Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9886675Z         contiguous: bool,
2025-05-07T20:32:58.9886755Z         compiled: bool,
2025-05-07T20:32:58.9886828Z     ) -> None:
2025-05-07T20:32:58.9886921Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9886989Z     
2025-05-07T20:32:58.9887155Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9889004Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9889013Z 
2025-05-07T20:32:58.9889129Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9889137Z 
2025-05-07T20:32:58.9889234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9889448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9889524Z     T=2048,
2025-05-07T20:32:58.9889592Z     D=5120,
2025-05-07T20:32:58.9889670Z     scale_ub=None,
2025-05-07T20:32:58.9889765Z     contiguous=False,
2025-05-07T20:32:58.9890082Z     compiled=False,
2025-05-07T20:32:58.9890194Z )
2025-05-07T20:32:58.9890458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9890628Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9890632Z 
2025-05-07T20:32:58.9890706Z     @given(
2025-05-07T20:32:58.9890912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9891009Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9891130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9891241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9891352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9891432Z     )
2025-05-07T20:32:58.9891678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9891771Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9891851Z         self,
2025-05-07T20:32:58.9891922Z         T: int,
2025-05-07T20:32:58.9891994Z         D: int,
2025-05-07T20:32:58.9892094Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9892179Z         contiguous: bool,
2025-05-07T20:32:58.9892264Z         compiled: bool,
2025-05-07T20:32:58.9892340Z     ) -> None:
2025-05-07T20:32:58.9892430Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9892504Z     
2025-05-07T20:32:58.9892670Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9894509Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9894520Z 
2025-05-07T20:32:58.9894640Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9894644Z 
2025-05-07T20:32:58.9894743Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9894962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9895034Z     T=4096,
2025-05-07T20:32:58.9895105Z     D=7168,
2025-05-07T20:32:58.9895189Z     scale_ub=None,
2025-05-07T20:32:58.9895271Z     contiguous=True,
2025-05-07T20:32:58.9895352Z     compiled=True,
2025-05-07T20:32:58.9895422Z )
2025-05-07T20:32:58.9895630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9895797Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9895802Z 
2025-05-07T20:32:58.9895873Z     @given(
2025-05-07T20:32:58.9895985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9896148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9896317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9896433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9896549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9896621Z     )
2025-05-07T20:32:58.9896869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9896959Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9897037Z         self,
2025-05-07T20:32:58.9897112Z         T: int,
2025-05-07T20:32:58.9897186Z         D: int,
2025-05-07T20:32:58.9897279Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9897371Z         contiguous: bool,
2025-05-07T20:32:58.9897453Z         compiled: bool,
2025-05-07T20:32:58.9897527Z     ) -> None:
2025-05-07T20:32:58.9897626Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9897696Z     
2025-05-07T20:32:58.9897857Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9899657Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9899667Z 
2025-05-07T20:32:58.9899868Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9899881Z 
2025-05-07T20:32:58.9900019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9900326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9900434Z     T=2048,
2025-05-07T20:32:58.9900538Z     D=5120,
2025-05-07T20:32:58.9900647Z     scale_ub=1200.0,
2025-05-07T20:32:58.9900766Z     contiguous=False,
2025-05-07T20:32:58.9900876Z     compiled=False,
2025-05-07T20:32:58.9900972Z )
2025-05-07T20:32:58.9901486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9901731Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9901738Z 
2025-05-07T20:32:58.9901843Z     @given(
2025-05-07T20:32:58.9901995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9902179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9902297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9902411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9902520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9902598Z     )
2025-05-07T20:32:58.9902838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9902933Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9903013Z         self,
2025-05-07T20:32:58.9903089Z         T: int,
2025-05-07T20:32:58.9903161Z         D: int,
2025-05-07T20:32:58.9903264Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9903354Z         contiguous: bool,
2025-05-07T20:32:58.9903445Z         compiled: bool,
2025-05-07T20:32:58.9903521Z     ) -> None:
2025-05-07T20:32:58.9903618Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9903696Z     
2025-05-07T20:32:58.9903863Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9905658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9905706Z 
2025-05-07T20:32:58.9905822Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9905827Z 
2025-05-07T20:32:58.9905927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9906152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9906230Z     T=4096,
2025-05-07T20:32:58.9906304Z     D=7168,
2025-05-07T20:32:58.9906386Z     scale_ub=1200.0,
2025-05-07T20:32:58.9906468Z     contiguous=True,
2025-05-07T20:32:58.9906549Z     compiled=False,
2025-05-07T20:32:58.9906623Z )
2025-05-07T20:32:58.9906833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9907003Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9907008Z 
2025-05-07T20:32:58.9907084Z     @given(
2025-05-07T20:32:58.9907198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9907305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9907420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9907530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9907646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9907724Z     )
2025-05-07T20:32:58.9908035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9908144Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9908226Z         self,
2025-05-07T20:32:58.9908302Z         T: int,
2025-05-07T20:32:58.9908374Z         D: int,
2025-05-07T20:32:58.9908467Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9908556Z         contiguous: bool,
2025-05-07T20:32:58.9908638Z         compiled: bool,
2025-05-07T20:32:58.9908712Z     ) -> None:
2025-05-07T20:32:58.9908807Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9908884Z     
2025-05-07T20:32:58.9909050Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9910819Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9910866Z 
2025-05-07T20:32:58.9910984Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9910994Z 
2025-05-07T20:32:58.9911094Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9911313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9911396Z     T=16384,
2025-05-07T20:32:58.9911469Z     D=7168,
2025-05-07T20:32:58.9911553Z     scale_ub=None,
2025-05-07T20:32:58.9911641Z     contiguous=False,
2025-05-07T20:32:58.9911721Z     compiled=True,
2025-05-07T20:32:58.9911791Z )
2025-05-07T20:32:58.9912004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9912178Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.9912182Z 
2025-05-07T20:32:58.9912256Z     @given(
2025-05-07T20:32:58.9912372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9912466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9912579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9912693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9912802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9912874Z     )
2025-05-07T20:32:58.9913162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9913256Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9913375Z         self,
2025-05-07T20:32:58.9913451Z         T: int,
2025-05-07T20:32:58.9913524Z         D: int,
2025-05-07T20:32:58.9913624Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9913710Z         contiguous: bool,
2025-05-07T20:32:58.9913800Z         compiled: bool,
2025-05-07T20:32:58.9913876Z     ) -> None:
2025-05-07T20:32:58.9913967Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9914040Z     
2025-05-07T20:32:58.9914204Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9915962Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9915974Z 
2025-05-07T20:32:58.9916088Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9916093Z 
2025-05-07T20:32:58.9916262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9916485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9916565Z     T=4096,
2025-05-07T20:32:58.9916637Z     D=7168,
2025-05-07T20:32:58.9916726Z     scale_ub=None,
2025-05-07T20:32:58.9916809Z     contiguous=True,
2025-05-07T20:32:58.9916893Z     compiled=False,
2025-05-07T20:32:58.9916972Z )
2025-05-07T20:32:58.9917181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9917354Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9917361Z 
2025-05-07T20:32:58.9917433Z     @given(
2025-05-07T20:32:58.9917551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9917652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9917764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9917877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9917999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9918075Z     )
2025-05-07T20:32:58.9918369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9918461Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9918535Z         self,
2025-05-07T20:32:58.9918614Z         T: int,
2025-05-07T20:32:58.9918691Z         D: int,
2025-05-07T20:32:58.9918787Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9918875Z         contiguous: bool,
2025-05-07T20:32:58.9918957Z         compiled: bool,
2025-05-07T20:32:58.9919036Z     ) -> None:
2025-05-07T20:32:58.9919130Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9919201Z     
2025-05-07T20:32:58.9919371Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9921129Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9921138Z 
2025-05-07T20:32:58.9921254Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9921264Z 
2025-05-07T20:32:58.9921362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9921621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9921735Z     T=16384,
2025-05-07T20:32:58.9921810Z     D=7168,
2025-05-07T20:32:58.9921892Z     scale_ub=None,
2025-05-07T20:32:58.9921974Z     contiguous=True,
2025-05-07T20:32:58.9922053Z     compiled=False,
2025-05-07T20:32:58.9922123Z )
2025-05-07T20:32:58.9922342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9922518Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.9922525Z 
2025-05-07T20:32:58.9922596Z     @given(
2025-05-07T20:32:58.9922714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9922809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9922925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9923038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9923147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9923220Z     )
2025-05-07T20:32:58.9923466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9923557Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9923633Z         self,
2025-05-07T20:32:58.9923702Z         T: int,
2025-05-07T20:32:58.9923773Z         D: int,
2025-05-07T20:32:58.9923874Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9924005Z         contiguous: bool,
2025-05-07T20:32:58.9924093Z         compiled: bool,
2025-05-07T20:32:58.9924169Z     ) -> None:
2025-05-07T20:32:58.9924262Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9924337Z     
2025-05-07T20:32:58.9924499Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9926255Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9926267Z 
2025-05-07T20:32:58.9926383Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9926391Z 
2025-05-07T20:32:58.9926492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9926782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9926853Z     T=16384,
2025-05-07T20:32:58.9926925Z     D=7168,
2025-05-07T20:32:58.9927012Z     scale_ub=1200.0,
2025-05-07T20:32:58.9927095Z     contiguous=True,
2025-05-07T20:32:58.9927175Z     compiled=False,
2025-05-07T20:32:58.9927247Z )
2025-05-07T20:32:58.9927458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9927638Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9927643Z 
2025-05-07T20:32:58.9927716Z     @given(
2025-05-07T20:32:58.9927831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9927927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9928037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9928153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9928267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9928341Z     )
2025-05-07T20:32:58.9928595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9928686Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9928760Z         self,
2025-05-07T20:32:58.9928837Z         T: int,
2025-05-07T20:32:58.9928905Z         D: int,
2025-05-07T20:32:58.9929004Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9929092Z         contiguous: bool,
2025-05-07T20:32:58.9929220Z         compiled: bool,
2025-05-07T20:32:58.9929293Z     ) -> None:
2025-05-07T20:32:58.9929423Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9929494Z     
2025-05-07T20:32:58.9929662Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9931422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9931431Z 
2025-05-07T20:32:58.9931545Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9931556Z 
2025-05-07T20:32:58.9931656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9931875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9931953Z     T=128,
2025-05-07T20:32:58.9932024Z     D=5120,
2025-05-07T20:32:58.9932101Z     scale_ub=1200.0,
2025-05-07T20:32:58.9932190Z     contiguous=False,
2025-05-07T20:32:58.9932270Z     compiled=False,
2025-05-07T20:32:58.9932341Z )
2025-05-07T20:32:58.9932597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9932772Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.9932776Z 
2025-05-07T20:32:58.9932846Z     @given(
2025-05-07T20:32:58.9932964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9933057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9933176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9933291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9933405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9933479Z     )
2025-05-07T20:32:58.9933726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9933816Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9933897Z         self,
2025-05-07T20:32:58.9933968Z         T: int,
2025-05-07T20:32:58.9934041Z         D: int,
2025-05-07T20:32:58.9934146Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9934277Z         contiguous: bool,
2025-05-07T20:32:58.9934367Z         compiled: bool,
2025-05-07T20:32:58.9934439Z     ) -> None:
2025-05-07T20:32:58.9934526Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9934594Z     
2025-05-07T20:32:58.9934756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9934827Z     
2025-05-07T20:32:58.9934918Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9935039Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9935128Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9935207Z         x0 = x[:, :D]
2025-05-07T20:32:58.9935285Z         x1 = x[:, D:]
2025-05-07T20:32:58.9935355Z     
2025-05-07T20:32:58.9935439Z         if contiguous:
2025-05-07T20:32:58.9935527Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9935612Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9935681Z     
2025-05-07T20:32:58.9935770Z         if scale_ub is not None:
2025-05-07T20:32:58.9935879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9936017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9936092Z             )
2025-05-07T20:32:58.9936167Z         else:
2025-05-07T20:32:58.9936261Z             scale_ub_tensor = None
2025-05-07T20:32:58.9936331Z     
2025-05-07T20:32:58.9936463Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9936549Z             op = silu_mul_quant
2025-05-07T20:32:58.9936632Z             if compiled:
2025-05-07T20:32:58.9936777Z                 op = torch.compile(op)
2025-05-07T20:32:58.9936880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9936987Z     
2025-05-07T20:32:58.9937081Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9937086Z 
2025-05-07T20:32:58.9937180Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9937307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9937412Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9937508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9938012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9938105Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9938465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9938690Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9939037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9939131Z     kernel = self.compile(
2025-05-07T20:32:58.9939519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9939693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9940017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9940028Z 
2025-05-07T20:32:58.9940320Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132b07940>
2025-05-07T20:32:58.9941483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9942223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3133089ea0>}
2025-05-07T20:32:58.9943128Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9943325Z context = <triton._C.libtriton.ir.context object at 0x7f3132c2b070>
2025-05-07T20:32:58.9943377Z 
2025-05-07T20:32:58.9943560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9943872Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9943983Z                            module_map=module_map)
2025-05-07T20:32:58.9944159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9944269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9944352Z E       ^
2025-05-07T20:32:58.9944786Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9944790Z 
2025-05-07T20:32:58.9945292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9945296Z 
2025-05-07T20:32:58.9945403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9945663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9945742Z     T=2048,
2025-05-07T20:32:58.9945819Z     D=7168,
2025-05-07T20:32:58.9945904Z     scale_ub=None,
2025-05-07T20:32:58.9945988Z     contiguous=False,
2025-05-07T20:32:58.9946074Z     compiled=False,
2025-05-07T20:32:58.9946147Z )
2025-05-07T20:32:58.9946361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9946541Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.9946590Z 
2025-05-07T20:32:58.9946668Z     @given(
2025-05-07T20:32:58.9946786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9946950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9947069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9947187Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9947306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9947381Z     )
2025-05-07T20:32:58.9947627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9947726Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9947803Z         self,
2025-05-07T20:32:58.9947886Z         T: int,
2025-05-07T20:32:58.9947963Z         D: int,
2025-05-07T20:32:58.9948063Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9948155Z         contiguous: bool,
2025-05-07T20:32:58.9948241Z         compiled: bool,
2025-05-07T20:32:58.9948319Z     ) -> None:
2025-05-07T20:32:58.9948420Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9948492Z     
2025-05-07T20:32:58.9948661Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9950493Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9950502Z 
2025-05-07T20:32:58.9950622Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9950631Z 
2025-05-07T20:32:58.9950736Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9950961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9951045Z     T=128,
2025-05-07T20:32:58.9951130Z     D=7168,
2025-05-07T20:32:58.9951214Z     scale_ub=1200.0,
2025-05-07T20:32:58.9951301Z     contiguous=True,
2025-05-07T20:32:58.9951384Z     compiled=True,
2025-05-07T20:32:58.9951457Z )
2025-05-07T20:32:58.9951674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9951841Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9951887Z 
2025-05-07T20:32:58.9951963Z     @given(
2025-05-07T20:32:58.9952078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9952174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9952292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9952407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9952520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9952598Z     )
2025-05-07T20:32:58.9952838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9952934Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9953017Z         self,
2025-05-07T20:32:58.9953091Z         T: int,
2025-05-07T20:32:58.9953167Z         D: int,
2025-05-07T20:32:58.9953271Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9953359Z         contiguous: bool,
2025-05-07T20:32:58.9953452Z         compiled: bool,
2025-05-07T20:32:58.9953529Z     ) -> None:
2025-05-07T20:32:58.9953625Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9953701Z     
2025-05-07T20:32:58.9953869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9953942Z     
2025-05-07T20:32:58.9954039Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9954162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9954251Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9954335Z         x0 = x[:, :D]
2025-05-07T20:32:58.9954460Z         x1 = x[:, D:]
2025-05-07T20:32:58.9954532Z     
2025-05-07T20:32:58.9954621Z         if contiguous:
2025-05-07T20:32:58.9954749Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9954841Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9954913Z     
2025-05-07T20:32:58.9955001Z         if scale_ub is not None:
2025-05-07T20:32:58.9955111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9955248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9955326Z             )
2025-05-07T20:32:58.9955409Z         else:
2025-05-07T20:32:58.9955506Z             scale_ub_tensor = None
2025-05-07T20:32:58.9955578Z     
2025-05-07T20:32:58.9955710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9955799Z             op = silu_mul_quant
2025-05-07T20:32:58.9955882Z             if compiled:
2025-05-07T20:32:58.9955985Z                 op = torch.compile(op)
2025-05-07T20:32:58.9956091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9956171Z     
2025-05-07T20:32:58.9956262Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9956266Z 
2025-05-07T20:32:58.9956365Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9956496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9956596Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9956696Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9957113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9957214Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9957713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9957814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9958173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9958402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9958743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9958838Z     kernel = self.compile(
2025-05-07T20:32:58.9959219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9959399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9959573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9959578Z 
2025-05-07T20:32:58.9959784Z self = <triton.compiler.compiler.ASTSource object at 0x7f3132ce8940>
2025-05-07T20:32:58.9960555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9961060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f329c092ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f313308b7f0>}
2025-05-07T20:32:58.9961804Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9961997Z context = <triton._C.libtriton.ir.context object at 0x7f3132cc3270>
2025-05-07T20:32:58.9962004Z 
2025-05-07T20:32:58.9962167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9962428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9962540Z                            module_map=module_map)
2025-05-07T20:32:58.9962703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9962883Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9962997Z E       ^
2025-05-07T20:32:58.9963532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9963538Z 
2025-05-07T20:32:58.9963967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9963972Z 
2025-05-07T20:32:58.9964078Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9964303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9964380Z     T=128,
2025-05-07T20:32:58.9964457Z     D=7168,
2025-05-07T20:32:58.9964544Z     scale_ub=1200.0,
2025-05-07T20:32:58.9964632Z     contiguous=True,
2025-05-07T20:32:58.9967745Z     compiled=False,
2025-05-07T20:32:58.9967826Z )
2025-05-07T20:32:58.9968058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9968242Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.9968248Z 
2025-05-07T20:32:58.9968327Z     @given(
2025-05-07T20:32:58.9968458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9968559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9968679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9968802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9968983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9969066Z     )
2025-05-07T20:32:58.9969327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9969424Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9969505Z         self,
2025-05-07T20:32:58.9969585Z         T: int,
2025-05-07T20:32:58.9969665Z         D: int,
2025-05-07T20:32:58.9969773Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9969867Z         contiguous: bool,
2025-05-07T20:32:58.9969959Z         compiled: bool,
2025-05-07T20:32:58.9970046Z     ) -> None:
2025-05-07T20:32:58.9970144Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9970223Z     
2025-05-07T20:32:58.9970401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9970481Z     
2025-05-07T20:32:58.9970575Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9970708Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9972471Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9972527Z 
2025-05-07T20:32:58.9972656Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.9972661Z 
2025-05-07T20:32:58.9972766Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9972996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9973077Z     T=128,
2025-05-07T20:32:58.9973157Z     D=5120,
2025-05-07T20:32:58.9973246Z     scale_ub=1200.0,
2025-05-07T20:32:58.9973335Z     contiguous=True,
2025-05-07T20:32:58.9973421Z     compiled=True,
2025-05-07T20:32:58.9973511Z )
2025-05-07T20:32:58.9973728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9973895Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9973906Z 
2025-05-07T20:32:58.9973984Z     @given(
2025-05-07T20:32:58.9974102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9974204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9974368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9974486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9974644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9974723Z     )
2025-05-07T20:32:58.9974972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9975070Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9975149Z         self,
2025-05-07T20:32:58.9975229Z         T: int,
2025-05-07T20:32:58.9975310Z         D: int,
2025-05-07T20:32:58.9975408Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9975504Z         contiguous: bool,
2025-05-07T20:32:58.9975591Z         compiled: bool,
2025-05-07T20:32:58.9975670Z     ) -> None:
2025-05-07T20:32:58.9975769Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9975844Z     
2025-05-07T20:32:58.9976010Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9976090Z     
2025-05-07T20:32:58.9976187Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9976311Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9978109Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9978119Z 
2025-05-07T20:32:58.9978236Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:58.9978241Z 
2025-05-07T20:32:58.9978347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9978568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9978650Z     T=128,
2025-05-07T20:32:58.9978732Z     D=7168,
2025-05-07T20:32:58.9978818Z     scale_ub=None,
2025-05-07T20:32:58.9978911Z     contiguous=True,
2025-05-07T20:32:58.9978998Z     compiled=True,
2025-05-07T20:32:58.9979074Z )
2025-05-07T20:32:58.9979290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9979458Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.9979462Z 
2025-05-07T20:32:58.9979583Z     @given(
2025-05-07T20:32:58.9979701Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9979922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9980046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9980163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9980275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9980354Z     )
2025-05-07T20:32:58.9980604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9980697Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9980777Z         self,
2025-05-07T20:32:58.9980852Z         T: int,
2025-05-07T20:32:58.9980929Z         D: int,
2025-05-07T20:32:58.9981029Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9981118Z         contiguous: bool,
2025-05-07T20:32:58.9981203Z         compiled: bool,
2025-05-07T20:32:58.9981288Z     ) -> None:
2025-05-07T20:32:58.9981382Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9981460Z     
2025-05-07T20:32:58.9981625Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9983449Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:58.9983496Z 
2025-05-07T20:32:58.9983615Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:58.9983747Z =============================== warnings summary ===============================
2025-05-07T20:32:58.9984057Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:58.9984360Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:58.9984656Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:58.9985539Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:58.9985774Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:58.9985778Z 
2025-05-07T20:32:58.9985995Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:58.9986201Z ================= 1 failed, 1 deselected, 3 warnings in 17.41s =================
2025-05-07T20:33:00.5501814Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:00.6128138Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:00.6128404Z 
2025-05-07T20:33:00.6128575Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:00.6129145Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:00.6129573Z 
2025-05-07T20:33:00.6129578Z 
2025-05-07T20:33:00.6129582Z 
2025-05-07T20:33:00.6145500Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:00.6225477Z Post job cleanup.
2025-05-07T20:33:00.7218628Z [command]/usr/bin/git version
2025-05-07T20:33:00.7263394Z git version 2.47.1
2025-05-07T20:33:00.7302206Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/0dcfceed-1031-4d6a-9b2c-6229e635b8d3/.gitconfig'
2025-05-07T20:33:00.7313022Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/0dcfceed-1031-4d6a-9b2c-6229e635b8d3' before making global git config changes
2025-05-07T20:33:00.7313890Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:00.7318432Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:00.7370502Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:00.7405322Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:00.7745575Z Entering 'external/asmjit'
2025-05-07T20:33:00.7812472Z Entering 'external/composable_kernel'
2025-05-07T20:33:00.7885813Z Entering 'external/cpuinfo'
2025-05-07T20:33:00.7953612Z Entering 'external/cutlass'
2025-05-07T20:33:00.8029624Z Entering 'external/googletest'
2025-05-07T20:33:00.8096246Z Entering 'external/hipify_torch'
2025-05-07T20:33:00.8164975Z Entering 'external/json'
2025-05-07T20:33:00.8251299Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:00.8274094Z http.https://github.com/.extraheader
2025-05-07T20:33:00.8284436Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:00.8315235Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:00.8651423Z Entering 'external/asmjit'
2025-05-07T20:33:00.8694138Z http.https://github.com/.extraheader
2025-05-07T20:33:00.8737922Z Entering 'external/composable_kernel'
2025-05-07T20:33:00.8780088Z http.https://github.com/.extraheader
2025-05-07T20:33:00.8829554Z Entering 'external/cpuinfo'
2025-05-07T20:33:00.8871978Z http.https://github.com/.extraheader
2025-05-07T20:33:00.8915512Z Entering 'external/cutlass'
2025-05-07T20:33:00.8957470Z http.https://github.com/.extraheader
2025-05-07T20:33:00.9008639Z Entering 'external/googletest'
2025-05-07T20:33:00.9050590Z http.https://github.com/.extraheader
2025-05-07T20:33:00.9092930Z Entering 'external/hipify_torch'
2025-05-07T20:33:00.9135263Z http.https://github.com/.extraheader
2025-05-07T20:33:00.9176997Z Entering 'external/json'
2025-05-07T20:33:00.9220652Z http.https://github.com/.extraheader
2025-05-07T20:33:00.9369312Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:00.9404581Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:00.9415033Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:00.9415393Z ##[endgroup]
2025-05-07T20:33:00.9514629Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:11.7054356Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:28.0591910Z Cleaning up orphan processes